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Preface 


This is an introduction to the fundamental concepts and techniques of numerical 
analysis and numerical methods for undergraduates, as well as for graduate en- 
gineers and applied scientists receiving their first exposure to numerical analysis. 
Applications drawn from the literature of many different fields will prepare students 
to use the techniques covered to solve a wide variety of practical problems. There is 
also sufficient mathematical detail to prepare students to embark upon an investi- 
gation of more advanced topics, especially in PDEs. The presentation style is what 
[ like to call tell and show. This means that the concepts and techniques are first 
developed in a clear, concise, and easy-to-read manner, and then illustrated with 
at least one fully worked example. In total, nearly 250 fully worked examples are 
presented to help the students grasp the sequence of calculations associated with a 
particular method and gain better insight into algorithm operation. 

The text is organized around mathematical problems, with each chapter de- 
voted to a single type of problem (e.g., rootfinding, numerical calculus: differentia- 
tion and integration, the matrix eigenvalue problem, and elliptic partial differential 
equations). Within each chapter the presentation begins with the simplest and 
most basic methods, progressing gradually to more advanced topics. Early chap- 
ters generally contain easier material, while later chapters proceed at increasing 
levels of difficulty and complexity. Throughout, emphasis is placed on understand- 
ing and being able to work with the key concepts of rate/order of convergence and 
stability, and assessing the accuracy of numerical results. This emphasis helps stu- 
dents develop skill in numerically verifying theoretical convergence speed. More 
importantly, the text emphasizes that it is not sufficient to obtain the correct an- 
swer from a numerical algorithm; one must also check that convergence toward the 
correct answer is happening at the correct speed. ] have always felt very strongly 
that a textbook must provide students with some means of checking their under- 
standing and honing their skills, some means of making the knowledge their own. 
This is invariably accomplished through the exercises. This text features more than 
1200 numbered exercises (many with multiple parts) organized into exercise sets at 
the end of each section. Each exercise set contains problems designed to provide 
students with the opportunity to practice (with paper, pencil, and calculator) the 
sequence of calculations associated with a particular method. ‘The exercises usually 
also require the verification of theoretical error bounds and/or theoretical rates of 
convergence. Additional exercises may require the derivation of a method, an exam- 
ination of conditions under which methods perform better or worse than predicted 
by theory, or extension of material presented in the section. Many exercises require 
students to code a numerical method on the computer and then use that computer 
code, and many exercises are application problems that require interpretation of 
results. 
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Distinctive Features 


A quick scan of the table of contents will reveal that certain topics typically found in 
a book of this nature, such as approximation (orthogonal least-squares, FFT, ratio- 
nal function approximation) and optimization, have been omitted. In place of these 
topics is an extensive coverage of material not usually found, or only briefly dis- 
cussed, in other texts. This extensive coverage includes treatment of non-Dirichlet 
boundary conditions, and artificial singularities for one-dimensional boundary value 
problems; treatment of non-Dirichlet boundary conditions, the multigrid method 
and irregular domains for elliptic partial differential equations; treatment of source 
and decay terms, non-Dirichlet boundary conditions, polar coordinates and prob- 
lems in two space dimensions for parabolic partial differential equations; and treat- 
ment of the advection and convection-diffusion equations. Why did I select such 
non-standard topics for inclusion? My primary objective in writing this text was to 
create a book that would allow students to immediately apply the numerical tech- 
niques they have learned to real-world problems. After reviewing technical journals 
and textbooks to determine the most commonly used basic numerical techniques 
and discussing topics with my colleagues and non-academic scientists /engineers, 
I felt that students would benefit more from an expanded coverage of boundary 
value problems and partial differential equations than they would from a superfi- 
cial coverage of these same topics and inclusion of those topics which have been 
omitted. 


In keeping with the objective of preparing students to apply numerical tech- 
niques, an extensive set of application problems has been compiled from the lit- 
erature of many different fields. Physics, biology, chemistry, chemical engineering, 
therraodynamics, heat transfer, electrostatics, ecology, manufacturing and sociol- 
ogy are arnong the fields represented. Each chapter opens with outlines of several 
real-world problems which serve to motivate the study and to demonstrate the 
broad applicability of the class of methods which will be treated in that chapter. 
Application problems then appear throughout the chapter as both worked examples 
and exercises. An added benefit of the application problems is that they afford the 
opportunity to discuss practical issues, such as introducing nondimensional vari- 
ables, treating singularities, and manipulating problems into the form required by 
a particular method. Perhaps the most distinctive feature of this book is the min- 
imal amount of pseudocode which appears. This feature is in marked contrast to 
other introductory textbooks on numerical analysis, which tend to have a lot of 
pseudocode, and usually some Maple and/or MATLAB code fragments, too. Un- 
fortunately, it has been my experience that most students don’t use pseudocode 
properly. What is intended by an author as a teaching tool, more often than not, is 
used by the students just to expedite the completion of an assignment. Instead of 
digging through each line of code to develop a deeper understanding of how and why 
each method works, the student simply translates the code into whatever happens 
to be the language of choice. When this happens, little or no.transfer of knowledge 
takes place. The end result in such a case is that the presence of pseudocode hinders, 
rather than promotes, student learning. For this reason, I have chosen to include 
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pseudocode only when the natural language description of an algorithm became too 
cumbersome or when the pseudocode was needed to develop some other essential 
idea. Although pseudocode has largely been removed, students have most certainly 
not been left without guidance in the production of efficient, working code. Where 
appropriate, programming hints have been provided, and important implementa- 
tion details have been discussed. Then, of course, there are the worked examples. 
These provide dynamic demonstrations for each of the algorithms being developed 
and contain sufficient detail to suggest an overall structure for the implementation 
of the algorithm. Furthermore, to recognize that structure, the student will have 
to become actively involved with the details of the example. Thus, when compared 
with pseudocode, which is a static representation of an algorithm, worked examples 
are, in my opinion, the superior alternative. 


Supplements and Software 


This text is accompanied by an Instructors Solutions Manual that can be ob- 
tained (by instructors only) by contacting either the local Prentice Hall sales rep 
or george_lobell@prenhall.com. There are also 70 plus pages of Answers to 
Selected Exercises for students, found on the website 


www. pcs. cnu.edu/~bbradie/textbookanswers. html 


To accommodate differing viewpoints on the pseudocode issue, implementations 
for all of the methods developed in the text can be downloaded via the Inter- 
net. Each method is available in MATLAB and C++ formats. Depending upon 
demand, Maple, Mathematica, MathCad and Fortran implementations may be 
added. Instructions for using the MATLAB functions are contained in the header 
of the corresponding m-file and are accessible through the standard MATLAB help 
(function name) facility. Each C++ function is described in the comments at the 
beginning of the code. The main page for obtaining the software is located at 
wwu.pcs.cnu.edu/~“bbradie/textbookcode. html. 


To the Student 


The best advice that I can give for working with this textbook is to be an active 
reader. This means that each time you come to a worked example, you should 
verify the results of all calculations on your own and attempt to fill in all of the 
missing details. A similar procedure should be employed for each proof that you 
read. Working in this fashion will not only hone your general mechanical and 
analytical skills, but will also significantly improve your understanding of how and 
why each numerical method works, and will stimulate the process by which you 
assimilate new knowledge and make it your own. The most common stumbling 
block encountered by numerical analysis students is difficulty in translating the 
natural language description of an algorithm into working computer code. Here is a 
scheme that you may find helpful in overcoming this problem. Start by identifying 
the inputs. The inputs should include every item that must be known for the 
code to perform its intended task. Don’t forget values that are needed to control 
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the termination of an iterative process. Next, identify the outputs, which are, 
of course, the values which the code is supposed to compute. Once the inputs 
and the outputs have been clearly identified, focus on the construction of a logical 
and well-defined sequence of steps that will produce the outputs from the inputs. 
The worked examples should prove extremely useful at this point. Finally, convert 
each step into the appropriate set of assignment statements, conditional/branching 
statements, loop structures and function calls. As with any new skill, the more you 
practice, the better you will become. 
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CHAPTER 1 


Getting Started 


AN OVERVIEW 


The diagram shown below provides a greatly oversimplified view of applied mathe- 
matics. The starting point is almost always some real-world problem or real-world 
phenomena that needs to be studied. The axioms and postulates of the appropriate 
discipline(s)—be they from the physical, natural, or social sciences—are then used 
to develop a set of assumptions and a set of equations, known as a mathematical 
model, which will be used for subsequent analysis. The type of equations that 
arise can range from simple algebraic equations to extremely complicated coupled 
systems of nonlinear partial differential equations. Once the model has been set, 
the next step is to solve the equations and interpret the results in the context of 
the original problem. If the predictions of the model are in agreement with experi- 
mental data, the model is accepted and can be used to make predictions regarding 
situations for which experimental data is unavailable. On the other hand, if the 
model fails to accurately reflect. some desired aspect of the dynamical behavior of 
the system, it is necessary to return to the model building phase and reexamine 
the validity of the model’s basic assumptions. 


Real-world problem 
Real-world phenomena 


! 


Mathematical model ~<q— 


Revision of model 
Solution (if necessary) 


; 


Interpretation of results I 


During the solution phase, ideally, an analytical solution is obtained. Unfor- 
tunately, an analytical solution is generally available for the simplest cases only. 
The vast majority of situations require the use of approximate solution techniques. 
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The objective of this text is to develop methods for determining approximate solu- 
tions for several classes of mathematical problems that commonly arise during the 
modeling of real-world phenomena. These will include such tasks as locating the 
roots of a function, determining the value of a definite integral, finding the solution 
of a two-point boundary value problem, and so on. 

When dealing with approximation methods, there is an essential separation 
into what could be referred to as the engineering side of the matter and the math- 
ematical side. There are, of course, the issues of which methods can be applied to 
which problems and what is the best way to implement a particular method (the 
engineering side); however, there are also the theoretical issues of how the methods 
work, how well the methods work and under what circumstances the methods can 
be expected to work (the mathematical side). This book will routinely address both 
sets of issues. 


An Example of the Modeling Process 


As an example, consider the motion of a simple pendulum. The rigid rod that forms 
the arm of the pendulum has length L and is assumed to be of negligible mass. If 
it is further assumed that the pendulum will undergo small amplitude oscillations 
and that energy losses due to air resistance will be negligible, then the second-order 
differential equation 

6+w"A=0 


provides a reasonable model. Here @ denotes the angle made by the pendulum 
arm with the vertical position, dots denote differentiation with respect to time, 
w = \/L/g is the natural frequency of the oscillations, and g is the acceleration 
due to gravity. Since the model equation has constant coefficients, an analytical 
solution is possible in this case. If it turns out that the amplitude of oscillations 
is not small and that air resistance cannot be neglected, a more appropriate model 
would be ee 
6+b0+w*sind =0, 


where b is a drag coefficient. The presence of the sine term makes this a nonlinear 
equation for which an analytical solution is no longer possible. Approximation 
techniques will have to be used to carry the analysis further. 

The remainder of this section presents a variety of different problems that 
require the use of some numerical approximation technique. In each case, reference 
is made to the type(s) of techniques that will be needed and to the chapter in which 
those techniques will be presented. 


Solving a Crime 


Commissioner Gordon has been found dead in his office. At 8:00 PM, the county 
coroner determined the core temperature of the corpse to be 90°F. One hour 
later, the core temperature had dropped to 85° F. Maintenance reported that the 
building’s air conditioning unit broke down at 4:00 PM. The temperature in the 
commissioner’s office was 68°F at that time. The computerized climate control 
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system recorded that the office temperature rose at a rate of 1°F per hour after 
the air conditioning stopped working. 

Captain Furillo believes that the infamous Doc B killed the commissioner. 
Doc B, however, claims that he has an alibi. Lois Lane was interviewing him at 
the Daily Planet Building, just across the street from the commissioner’s office. 
The receptionist at the Daily Planet Building checked Doc B into the building at 
§:35 PM, and the interview tapes confirm that Doc B was occupied from 6:40 pm 
until 7:15PM. Could Doc B have killed the commissioner? 

To answer this question, we need to determine the time of death from the 
information we have at hand. We will assume the core temperature of the corpse 
was 98.6° F at the time of death and began decreasing immediately following death. 
We will further assume that the decrease in core temperature proceeded according 
to Newton's Law of Cooling. This principle states that the temperature of an object 
will change at a rate proportional to the difference between the temperature of the 
object and that of its surroundings. 

To explicitly formulate our model, let T(t) denote the core temperature of the 
corpse as a function of time, with time measured in hours. Take ¢ = 0 to correspond 
to 8:00PM. Using this coordinate system, we know 


T(0)=90 and T(1) =85. (1) 


Furthermore, the office temperature is given by Tomce = 72-+¢. Applying Newton’s 
Law of Cooling, we obtain 


aT 
ae —k(T — 72-1), (2) 
where k is a positive constant of proportionality. To complete our analysis, we must 
first determine the solution of (2) that satisfies the conditions stated in (1). Then, 
using this solution, we must determine the time when the core temperature of the 
corpse was 98.6° F. 

Fortunately, (2) is a linear, first-order differential equation that may be solved 
by the method of integrating factors. The solution obtained in this manner is 


T(t) = (7 +t~ i) + ce™, (3) 


where c is @ constant of integration. If we substitute t = 0 and use the fact that 
T(0) = 90, we find 


1 
=18+-. 4 
c +E (4) 


To determine the value of k, substitute (4) and t = 1 into (3) and use T(1) = 85 to 
obtain 


1 L\ kg 
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Equation (5) cannot be solved explicitly for k, so we will have to settle for an 


approximate solution. Once a value for & has been obtained, the time of death, tg, 
is determined by the equation 


1 1 
2+ta- 7+ (18 + z) ewe = 98.6, (6) 


Note that this equation also cannot be solved explicitly for ty. 

Equations (5) and (6) are examples of a class of problems known as rootfind- 
ing problems. In Chapter 2, we will develop several techniques for computing 
approximate solutions to rootfinding problems. 


Steady-State Distribution of the British Workforce 


In a study of class mobility in modern Great Britain, Goldthorpe and Llewellyn [1] 
classified workers into seven occupational classes: 


. higher-grade professionals/administrators; 

. lower-grade professionals/administrators and higher-grade technicians; 
. routine nonmanual employees; 

. small proprietors; 

. lower-grade technicians and supervisors of manual laborers; 

. skilled manual laborers; and 

. semiskilled and unskilled manual laborers. 
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Using data from 10,309 men aged 20-64 living in England and Wales, the proportion 
of children, »;;, born to parents in occupational class 7 who eventually became 
members of occupational class i was estimated. The collection of values obtained 
by Goldthorpe and Llewellyn is summarized in the matrix 


0.452 0.291 0.184 0.126 0.142 0.078 0.065 

0.189 0.231 0.157 0.114 0.1386 0.088 0.078 

0.115 0.119 0.128 0.080 0.101 0.083 0.082 

P= [p;;] = | 0.077 0.070 0.078 0.244 0.077 0.065 0.066 
0.048 0.096 0.128 0.087 0.157 0.123 0.125 

0.054 0.106 0.156 0.144 0.212 0.304 0.235 

0.065 0.087 0.169 0.205 0.175 0.259 0.349 


In the vocabulary of discrete dynamical systems, each p,; is called a transition 
probability, and the matrix P is called a transition matrix. Note that the sum of 
the entries in each column of P is equal to 1. 

Suppose the vector m9 denotes an initial distribution of workers across the 
indicated occupational classes. Given the definition of the entries in the transition 
matrix P, it follows that the distribution of workers in the next generation is given 
by the vector 

Ty = Po. 
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The distribution of workers in subsequent generations is then given by 


w4=Pr3=P*r, 


and so on. If the transition matrix P remains valid over time, will the vectors 
Tn = Po approach a fixed, steady-state distribution vector m7 for any initial 
vector 9? 

To answer this question, we need to identify another property of the matrix P. 
Observe that every element in P is strictly greater than zero. Any transition matrix 
for which some integer power of the matrix has all positive entries is called regular; 
hence, P is a regular transition matrix. Because P is a regular transition matrix, 
it can be shown that as n — oo and for any initial distribution m9, the sequence of 
vectors P" 2, approaches the fixed vector 7m that satisfies 


Pr=7 (7) 


subject to the constraint that the sum of the entries in m is equal to 1 (see Anton 
and Rorres [2]). 

A vector that satisfies equation (7) is called an eigenvector of the matrix P. 
Since Pz is specifically equal to 1-7, we say that 7 is an eigenvector of P associated 
with the eigenvalue 1. We will develop techniques for approximating eigenvalues 
and eigenvectors of matrices in Chapter 4. 


Estimating a Coefficient of Friction 


Fo 


When a flexible rope is wrapped around a rough cylinder, a small “restraining” 
force of magnitude Fo applied at one end of the rope can withstand a force of 
magnitude F' > Fo applied at the other end. The magnitude of the force that can 
be withstood depends on the angle @ through which the rope is wrapped around 
the cylinder (see figure above) and on the coefficient of friction, 4, between the rope 
and the cylinder. The coefficient of friction is defined by the relation 


dF 


We = BF 
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As part of a physics lab, students need to estimate the coefficient of friction 
between a given rope and cylinder. For a restraining force of Fy = 5 Ib, they have 
obtained the following data for F as a function of @. 


6 0 n/2 mn 3n/2 mr 5e/2 3 T/2 4 Qr/2 9 5x 
F(6) 5.00 7.83 12.27 19.22 30.10 47.15 73.86 115.70 181.24 283.90 444.71 


To complete the lab, the students must be able to take this data and estimate 
dF/d#. This problem requires techniques for numerical differentiation, which are 
considered in Chapter 6. 


Projection Printing 


Cc) Light source 
in ee eine 


SS ha 


Oe 
oti 
Substrate 


Projection printing is one of the major processes involved in the fabrication of 
semiconductor devices. A simple schematic is shown in the figure above. The light 
source is used to project a pattern from the mask onto the surface of the photoresist 
film. The pattern may, for example, indicate the locations where wires are to be 
placed or where metal contacts are to be acid etched. 

The photoresist film contains a chemical called a photoactive compound, or 
PAC. The absorption of light by the PAC causes a reaction in which the PAC breaks 
down. At the end of the exposure process, the contour lines of PAC concentration 
form a latent image in the resist film, much like a photographic negative. 

Let I{z,t) denote the light intensity and M(z,t) denote the normalized con- 
centration of PAC within the photoresist film during the exposure process. Here, 
z measures depth into the film, with z = 0 corresponding to the air-film interface, 
and ¢ denotes time. Our objective is to determine M(z,texp), where texp is the 
duration of the exposure. 
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Assuming that the substrate absorbs the incident light and effectively elimi- 
nates reflection, then J and M satisfy Dill’s equations [3], which are given by 


ar 
=. = -1(AM + B) (8) 
ea = —IMC. (9) 


The constants A, B and C are material properties of the photoresist film whose 
values can be measured experimentally. The auxiliary conditions associated with (8) 
and (9) are 

M(z,0)=1 (10) 
and 

I(0,t) = Ip. (11) 
Following Babu and Barouch [4], we can reduce the system of partial differential 
equations (8) and (9) to an initial value problem involving an ordinary differential 
equation for M(z, texp). 

If we divide (9) by MC, substitute the resulting expression for —J into (8) 
and then rearrange terms, we obtain 
Of AM+BOM 0 


Cr = a ay = AM + BoM. (12) 


Alternatively, if we divide (9) by -M and differentiate with respect to z, we find 
at _ a [10M 
Oz Ot | M ax 
Combining (12) and (13) then yields 


(13) 


o 1 OM 


which, upon integration with respect to t, gives 


1 aM 
AM + BlnM + 75 5- = ft) (14) 


for some arbitrary function f. Substituting t = 0 into (14) and applying (10), we 
find f(z) = A. Using this expression in (14) and evaluating at t = texp, it follows 
that 
aM (z, texp) 
dz 


To obtain the initial condition associated with (15), first substitute z = 0 
into (9). Using (11), this yields 


= M(z, texp) [A(1 — M(z, texp)) - Bla M(z,texp)]- (18) 


dM (0, t) 
ot 


= ~IpCM(0,t). 
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The solution of this equation, taking into account (10), is M(0,t) = exp(—JpCt); 
therefore, 


M(0, texp) = exp(—LoCtexp)- (16) 


Equations (15) and (16) form the desired initial value problem. Techniques for 
approximating the solution of initial value problems are developed in Chapter 7. 


Rise in the Water Table due to the Spring Thaw 


An aquifer is a geological formation through which groundwater flows easily to 
supply a well or a spring. When studying an aquifer the primary quantity of 
interest is the water table. The water table is defined to be the location where the 
relative pressure (i.¢., the pressure due to the water only) is zero and is generally 
a function of both time and position. The diagram following shows an idealized 
one-dimensional aquifer and its water table. Note the presence of wells at which 
the height of the water table can be monitored. 


Monitoring Monitoring 
well well 


Suppose the monitoring wells in the following aquifer are separated by a dis- 
tance of 800 meters. During the spring thaw, the water level in each well is recorded 
on a regular basis. Using this information, we want to determine the behavior of 
the water table over time. Let h(x,t) denote the water table at a distance of x 
meters from the left monitoring well and at time t. Take t = 0 to correspond to 
the start of the spring thaw. (See figure at top of next page.) 

To model the change in h, consider the representative section of the aquifer 
shown above. The section has a length of one unit in the direction perpendicular 
to the page. Assuming that the density (mass per unit volume) of the water is 
constant, conservation of mass requires that the change in the volume of water 
within this representative section be equal to the net volume of water that flows 
into the section. The change in the volume of water within the section is given by 


SAhAz - 1, (17) 


where S is the hydraulic storativity, which measures the ability of the aquifer to 
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Change in water table due to 
net inflow of water 


hq, 


absorb and release water. The net volume of water that flows into the section is 


haedt-1— (hae + 9 Me) ns Az) At-1 =~ ~Nt8) ps At-1. (18) 


Here, g, is the specific flow rate (volumetric flow rate per unit area) of water in the 
x direction. Equating (17) and (18), dividing by the product AwAt and taking the 
limit as At — 0, we obtain 
g oh _ rae) 
Ot Ox 


The specific flow rate is related to the slope of the water table by Darcy's law, 
which states 


(19) 


(20) 


where K is the hydraulic conductivity of the aquifer. Substituting (20) into (19) 


yields 

—=--|hK—}. 

a = os (Ge) 2) 
Next, we assume that K is constant. Further, we make what is known as the 
Boussinesq approximation, which involves replacing the expression 


Oh oh 
hK Oz by hay Oz . 
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Here, Ray denotes the average level of the water table, the average being taken over 
both time and space. With these simplifications, equation (21) becomes 


va (22) 


where a = Af js the hydraulic diffusivity. To completely determine the water 
table, we must supplement (22) with the initial condition 


h(w, 0) = ho(z) (23) 
and the boundary conditions 
A(O,t)=Az(t) and  A(800,¢) = Ar(t). (24) 


The function ho{z) specifies the water table at the start of the spring thaw, and 
the functions h(t) and A(t) are obtained from the measurements made at the left 
and right monitoring wells, respectively. 

Equation (22) is an example of a parabolic partial differential equation, and 
the combination of equations (22), (23) and (24) is called an initial boundary value 
problem. We will develop techniques for approximating the solution of an initial 
boundary value problem involving a parabolic partial differential equation in Chap- 
ter 10. 
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1.1 ALGORITHMS 


At the heart of numerical analysis is the concept of an algorithm. 


Definition. An ALGORITHM is a precisely defined sequence of steps for per- 
forming a specified task. 


Throughout this text, we will design and implement (either by hand or with 
the aid of a calculator or computer) algorithms for computing approximate solu- 
tions to certain classes of mathematical problems. We will also critically examine 
the performance of these algorithms. Our objectives will include, but not be lim- 
ited to, determining the conditions under which an algorithm is expected to work, 
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how accurately the solution produced by an algorithm approximates the exact solu- 
tion of the underlying problem, and how various parameter values affect algorithm 
performance. 

Though the algorithms we develop will vary in terms of the number of steps 
involved, overall complexity, and general objective, they will all consist of three 
basic components. The first is a list of the input parameters. These are the quan- 
tities that must be supplied for the algorithm to be able to carry out its designated 
task. The second component then states specifically what operations need to be 
performed and in what order they need to be performed. The final component 
identifies the output of the algorithm, the information that is reported back to the 
user. 


An Example from Statistics 


A common task in data analysis is the calculation of the mean, %, and the standard 
deviation, s, of a collection of data. The mean is the most commonly used value to 
locate the “middle” of a data set, and the standard deviation measures the variation 
of the data about the mean. Let n denote the number of elements in the data set, 
and let 2; denote an individual element from the data set, where 7 ranges from 1 
through n. The formulas for the mean and standard deviation, which may be found 
in any elementary statistics textbook, are 


"8; mp2 —(S0e aye 
2s * and s= 2 doiet me oe z;) ; (1) 


a 


To construct an algorithm to compute the mean and standard deviation, we 
first identify the inputs. Examining the formulas in (1), it is clear that the only 
values that must be supplied are the 2; and n. Next, we focus on the steps that 
need to be performed. The key expressions in {1) are the two summations 


n 


n 
S aiee rand’: Sa 
i=1 


wl 


Once these values are known, the remaining calculations can be performed. Finally, 
we note that the algorithm must report both the mean and the standard deviation. 
Bringing all of this information together, we obtain the following algorithm. 


GIVEN: an array of real numbers, 2; 
the number of elements in the array, n 


STEP 1: initialize zsum and z2sum to 0 
STEP 2: for i from 1 ton 
add 2; to zsum 
add x? to z2sum 
end 
STEP 3: calculate zbar = zsum / in 
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STEP 4:° calculate s = sqrt( (n(22sum) - (rsum)?) / (n(n — 1)) ) 
OUTPUT: xbar and s 


Observe that in STEP 1 the variables that will be used to accumulate the two 
important summations are initialized to zero. It is always good practice to initialize 
any variable before it is used later in the algorithm. 


EXAMPLE 1.1 The Statistics Algorithm in Action 


Let’s apply the above algorithm to calculate the mean and standard deviation of 
the data set 


The inputs are thus 
=, t2=3, w3=5, ta=7, 23=9, and n=5. 


Working sequentially through the steps of the algorithm then produces the following 
results. 


STEP 1: zsum=0; z2sum = 0 

STEP 2: i=1: weum=0+1=1; 2sum=04 2 =1 
$=2: asum=14+3=4 wWsum=1+4 3? =10 
i=3: seum=44+5=9; 2fsum= 104 5? = 35 
j=4: weum=9+4+7=16; 22sum= 35 +7? = 84 
i=5:  arsum—16+9=25; 22sum = 84 + 9? = 165 

STEP 3: ghar = 25/5 =5 

STEP 4: a = /(8(165) — (25)*)/6- 4) = 3.162 

OUTPUT: asbar = 5 and s = 3.162 


Thus, for the data set consisting of the five numbers 1, 3, 5, 7, and 9, the mean is 
Z =: 5 and the standard deviation is s = 3.162. 


2 SS Ee ee SaaS» 


An Algorithm for Approximating a Definite Integral 


As a second example, consider the trapezoidal rule. This is one of the most basic 
schemes for approximating the value of the definite integral 


/ ” Fe) dx 


for an arbitrary integrand, f. For a fixed positive integer n, introduce the partition 


Q=%9 <2) <Q <6 SM Ipay Son =O 
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over the integration interval [a,b], where z; = a + th for each i and h = (b—a)/n. 
Using this partition, the trapezoidal rule approximation is then given by 


® h 
[ seraex 5 


flo) +2" fle + HO). @) 


Clearly, the input to an algorithm that computes the trapezoidal rule ap- 
proximation must include the integrand, f, and the limits of integration, a and b. 
Without these items, the underlying mathematical problem is not even defined. 
The algorithm also requires the positive integer n, which defines the partition. In 
terms of calculations, note that (2) contains a summation, so we will have to ini- 
tialize a variable to accumulate that value. We also need to compute 2 so we can 
later compute each z;. To apply the formula in (2), we work from inside the square 
brackets outward. First, calculate the summation. Next, multiply by two, and 
add in the function values f(a) and f(b). Finally, multiply by h/2 and output the 
result. 

Here is the final algorithm. 


GIVEN: the limits of integration a and b 
the integrand f 
the number of subintervals n 


STEP 1: compute h = (b — a}/n; and initialize sum to 0 
STEP 2: for i from 1 ton-1 
add f(a +th) to sum 
STEP 3: roultiply sum by 2 
STEP 4: add f(a) and f(b) to sum 
OUTPUT: (h/2)sum 


EXAMPLE 1.2 Trapezoida! Rule in Action 


We will now demonstrate the trapezoidal rule algorithm by approximating the value 
of the definite integral 
2 da 
je 
Matching this specific problem to the general pattern, [ bd f(z) dx, we see thata = 1, 
b= 2 and f(z) = 1/z. For the last input parameter, let’s take n = 4. 
Working sequentially through the steps yields 


STEP 1: h=(2-1)/4=1/4, sum=0 

STEP 2: t=1: sum=O0+1/(1+1/4)=4/5 
i=2:  sum=4/3+1/(141/2) = 22/15 
i=3:  sum= 22/15 +1/ (1+ 3/4) = 214 / 108 

STEP 3: sum = 2 ( 214 / 105) = 428 / 105 

STEP 4: sum = 428/105 + 1/1+1/2= 1171 / 210 

OUTPUT: (1/8) (1171 / 210 ) = 1171 / 1680 = 0.697023809 
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Therefore, 
2 
d 
i = ~ 0.697023809. 
1 2 


The exact value of the integral is, of course, In2, so the absolute error in the 
trapezoidal rule approximation is [In 2 — 0.697023809| ~ 3.877 x 10-3. 


As a prelude of things to come, let’s investigate the effect of the input pa- 
rameter on the performance of the trapezoidal rule. For several different values 
of n, Table 1.1 displays the trapezoidal rule approximation, T,,, to ie : dz and the 
corresponding absolute error, |en|, where 


2 
en=Ty- f Le 
1 & 


Note that on successive rows of the table, the value of n doubles. Clearly, |e,| 
is a decreasing function of n. Can we now use the data to determine a specific 
functional form for the relationship between |en| and n? 


n__ Approximation, J, Absolute Error, |e,| Error Ratio, lenl/lens2l 
2 0.708333333 0.015186152 

4 0.697023810 0.003876628 0.255273883 

8 0.694121850 0.000974670 0.251422112 

16 0.693391202 0.000244022 0.250363712 

32 0.693208208 0.000061028 0.250092204 


TABLE 1.1: Trapezoidal Rule Approximation to fr 4 dx as a Function of 


This is where the fourth column of Table 1.1 comes into play. This column 
lists the ratio |en|/|€n/2|. Observe that each time n is doubled, the absolute error 
is reduced by a factor of roughly one-quarter. Thus, the numerical evidence from 
this problem suggests that, for the trapezoidal rule, 


léra| oS e/n?, 


where c is independent of n. When we study the trapezoidal rule in more detail in 
Chapter 6, we will show that this is, in fact, the correct functional relationship. 


Approximating a Square Root 


Many of the algorithms developed in later chapters will work in much the same way 
as the trapezoidal rule. They will compute a single approximation to the solution 
of a particular mathematical problem. In contrast, other algorithms will generate a 
sequence of approximations which hopefully converge toward the desired solution. 
Algorithms of this type are called iterative. 
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When constructing an iterative algorithm, there are a few important details 
to keep in mind. First, every iterative algorithm must contain a stopping condition. 
A stopping condition is just a test that is used to decide when to terminate the 
iterative process and to accept the most recently computed term in the sequence 
as a final approximation. Second, to accommodate those times when the sequence 
either does not converge or converges very slowly, it is advisable to impose an 
upper limit on the number of iterations that will be allowed. Third, it is generally 
not necessary to save every term from the sequence. Typically, an algorithm can 
perform its task with knowledge of only a few of the terms. 

To illustrate these points, consider the following. Let a be a nonnegative real 
number. For any positive real number zo, the sequence generated by the rule 


Tn+1 = ; (e, ae 2) (3) 


Zn 


for n = 0,1,2,... converges to /a. In the next section, we will investigate this 
sequence in more detail and establish that the quantity |tn4; ~ 2,| provides an 
estimate for the difference |tn4 — ./a|. An appropriate stopping condition would 
therefore be to terminate iteration as soon as |tn41 — Z| falls below an input 
parameter known as the convergence tolerance, which we shall denote by ¢. To 
prevent the algorithm from getting caught in an infinite loop, we limit the number 
of iterations to Nmaz, which is another input parameter. Finally, note that during 
any iteration, we have to know the value of z,, in order to compute tn4). Once tn41 
has been calculated, we still need to have zy available so that we may determine 
whether the stopping condition has been met. If it is determined that another 
iteration needs to be performed, however, we no longer need z,. Accordingly, only 
two terms from the sequence need to be saved at any given time. 


Bringing all of this information together, we arrive at the algorithm 


GIVEN: nonnegative real number a 
starting approximation x 
convergence parameter € 
maximum number of iterations Nmax 


STEP 1: for iter from 1 to Nmaz 
STEP 2: compute 21 = (xo + a/zo)/2 
STEP 3: if |z) — 2o| < ¢, OUTPUT 2x, 
STEP 4: copy the value of z1 to 29 
end 
OUTPUT: “maximum number of iterations has been exceeded” 


The variable zg holds the value of 2,,, while x holds the value of tn4,. STEP 4 is 
required so that 2 contains the correct value for the next iteration. 


16 Chapter 1 Getting Started 


EXAMPLE 1.3 Square Root Algorithm in Action 


Suppose we wish to approximate the value of /2. Since the algorithm we just 
constructed generates an approximation to s/a, this fixes a = 2. For the remaining 
input parameters, let’s use rp = 2, « = 0.005 and Nmaz = 10. With the input 
parameters set, the first iteration of the algorithm proceeds as follows: 


STEP 1: iter = 1 

STEP 2: &y = (24+ 2/2)/2 = 3/2 
STEP 3: |v) — Zp| = 1/2 > 0.005, so 
STEP 4: Xy = 3/2. 


In the second iteration, we find 


STEP 1: tter = 2 

STEP 2: xy = (3/2 + 4/3)/2 =17/12 
STEP 3: |z) — 2o| = 1/12 > 0.005, so 
STEP 4; to = 17/12. 


The third iteration then yields 


STEP 1: iter = 3 
STEP 2: a = (17/12 + 24/17)/2 = 577/408 
STEP 3: |Z) — Zp] = 1/408 = 0.00245 < 0.005, so 


OUTPUT a1 = 577/408 = 1.414215686. 


Thus V2 + 1.414215686, which is in error by roughly 2.124 x 1078. 


EXERCISES 


1. Use the statistics algorithm from the text to compute the mean, Z, and the 

standard deviation, s, of the data set: —5, -3,2, -2,1. 
2. With n = 4, use the trapezoidal rule algorithm from the text to approximate the 

value of the definite integral 

1 
1 
ae 
9 l+2 


3. Use the square root algorithm from the text to approximate /5. Take zo = 5, 
€=5x 1074 and Nmaz = 10. 

4. A different scheme for approximating the square root of a positive real number a 
is based on the recursive forraula 


xe + 32na 
302 +a” 


Int. = 


(a) Construct an algorithm for approximating the square root of a given positive 
real number a@ using this formula. 
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(b) Test your algorithm using a = 2 and zg = 2. Allow a maximum of 10 
iterations and use a convergence tolerance of e = 5 x 107°. Compare the 


performance of this eee with the one presented in the text. 


. Let A beannxm matrix an 


B bean mxp matrix. Then xp matrix C = AB 
has elements defined by | 


™ 
Cik = Ss aig b5k 
jel 


for each ¢ = 1,2,3,...,7 an aap k = 1,2,3,...,p. Construct an algorithm to 
compute the product of two matrices. 


. Consider the computation of the following sum, 


my 


i=1 j=1 


where the a; and by are real numbers. 

(a) How many multiplications and how many additions are required to compute 
the sum? Each answer should be a function of n. 

(b) Modify the summation to an equivalent form that reduces the number of 
operations needed. How|many multiplications and how many additions are 
required to compute the/sum in its revised form? 


. Let a be a nonzero real number. For any zo satisfying 0 < zg < 2/a, the 
recursive sequence defined by 


Inti = Fn(2— azn) 


converges to 1/a. 


(a) Construct an algorithm for approximating the reciprocal of a given nonzero 
real number a using ths rol 

(b) Test your algorithm using a = 37 and zg = 0.01. Allow a maximum of 10 
iterations and use a convergence tolerance of = 5 x 1074, 


. Given two positive integers a and b, the greatest common divisor of a and b is the 
largest integer which divides|both a and 6 (i.e., the largest integer n for which 
both a/n and 6/7 are integers). 


(a) Construct an algorithm|to compute the greatest common divisor of two 
positive integers. 


(b) How many divisions does your algorithm require? 


. The inner product, or dot product, of two n-vectors x and y is given by 
X-Y¥ = Liys + Lay2 + 23y3 +--+ + Lnya. 


(a) Construct an algorithm c compute the inner product of two n-vectors. 
(b) Apply your algorithm to calculate the inner product of the vectors 


x=[-3 4 1) 2]” and y=[1 -3 2 5]7. 
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10. The linear correlation coefficient for n ordered pairs (2i, ye) is given by the for- 


11. 


12. 


13. 


14. 


mula 


aes te eee TiYi — Ones a) cee vi) 
Vr Dna - (Chm) non a - (Sw)? 


(a) Construct an algorithm to compute the linear correlation coefficient for a 
given set of ordered pairs. 


(b) Apply your algorithm to compute the linear correlation coefficient for the 
following set of ordered pairs: 
zi 3 7 9 2 7 0 3 
y% -~-5 10 15 -8 11 ~10 -4 


The midpoint rule approximates the value of a definite integral using the formula 


b n 
| F(z) de ~ 2h S~ flay), 
a j=l 


where h = (b—a)/2n and x; =a + (27 — 1h. 
(a) Construct an algorithm to approximate the value of a definite integral using 
the midpoint rule. 


(b) Apply your algorithm to approximate the value of i dz/x. Taken = 
4. Compare the approximation obtained from the midpoint rule with the 
approximation obtained from the trapezoidal rule. 


Consider the following algorithm for the trapezoidal rule: 


GIVEN: the limits of integration a and b 
the integrand f 
the number of subintervals n 


STEP 1: compute h = (b— a)/n,; and initialize sum to 0 
STEP 2: for 7 from lton~1 
add 2f(a+ih) to sum 
STEP 3: add f(a) and f(b) to sum 
OUTPUT: (h/2)sum 


Compare the number of arithmetic operations required by this algorithm to the 
number of operations required by the algorithm presented in the text. 


Rewrite the algorithm for the trapezoidal rule that was presented in the text to 
reduce both the number of additions and the number of multiplications/divisions 
by one. 


Let P(x) = ant” +an—12"-) +--. 4a, 2+a9 be an nth-degree polynomial with 

all real coefficients, and let zo be a given real number. 

(a) Treating mteger powers as repeated multiplication, how many multiplica- 
tions and how many additions are required to evaluate P(x) at x = x0? 


(b) Devise an algorithm for computing the value of an nth-degree polynomial 
that reduces the required number of arithmetic operations. How many mul- 
tiplications and how many additions are required by your algorithm? 
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For Exercises 15-18, make use of the fact that when the sum of a convergent alternating 
series is approximated using the sum of the first n terms, the error in this approxima- 
tion is smaller than the magnitude of the (n+1)st term; that is, if }*(—1)"an is an 
alternating series with sum S, then 


15. The value of 7 is given by 
7 : 
(-1)” ( Li Ady 2 = ) 
— =4(1-—-+ Se eek i, 
" 1nd ee a 


(a) Construct an algorithm to approximate the value of 7 to within a specified 
tolerance, «. 


(b) Test your algorithm with a tolerance value of e = 5 x 107°. 
16. The value of 1/e is given by 


wo 


Sa (-)™ =, 111 
lfe= >> a Satan gee ee 


n=0 
(a) Construct an algorithm to approximate the value of 1/e to within a specified 
tolerance, e. 
(b) Test your algorithm with a tolerance value of «= 5 x 1077. 
17. The value of sin(7/10) is given by 


au(Z) oS MS) 


t ee eh a (2) 42 

10 3! \10/ * 5! \10 7! \10 

(a) Construct an algorithm to approximate the value of sin(7/10) to within a 
specified tolerance, e. 


(b) Test your algorithm with a tolerance value of € = 5 x 1077. Note that the 
exact value of sin(1/10) is $(v5 — 1). 
18. The value of cos(7/5) is given by 


oo (5) = Doar G) 3-9 (G) +a) -aG) + 


n=0 


Ht 


(a) Construct an algorithm to approximate the value of cos(7/8) to within a 
specified tolerance, e. 

(b) Test your algorithm with a tolerance value of e = 5 x 107’. Note that the 
exact value of cos(1/5) is ¢(v5 +1). 
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1.2 CONVERGENCE 


Many of the algorithms that will be developed in later chapters will be iterative in 
nature. These algorithms will generate a sequence of approximations that converge 
toward the desired solution. When several techniques are available for solving a 
particular problem, we would generally like to choose a technique whose sequence 
converges as rapidly as possible. To facilitate a comparison between competing 
methods, we will introduce in this section two quantitative measures of convergence 
speed. 


Rate of Convergence 
For completeness, remember that convergence of a sequence is defined as follows. 
Definition. The sequence {z,} CONVERGES to the value L provided 


lim fp = L, 
NOS 
or, equivalently, 
lim |Z, — L| = 0. 
RaWO 


The value to which the sequence converges, L, is called the limit of the se- 
quence. A sequence for which lim,—sc tn does not exist is said to diverye. 

The two principal measures of convergence speed are known as rate of con- 
vergence and order of convergence. Let’s consider rate of convergence first. 


Definition. Let {pn} be a sequence that converges to a number p. If there 
exists a sequence {8,} that converges to zero and a positive constant d, ine 
dependent of n, such that 


[Pn — PISA \Bn| 


for all sufficiently large values of n, then {pn} is seid to converge to p with 
RATE OF CONVERGENCE O(6,). 


The expression O(Gn) is read “big-O of G,” and is referred to as big-O no- 
tation. When {pn} converges to p with rate of convergence O(Bn), it is common 
to express this in shorthand by writing pP, = p+ O(Bn); hence, the big-O term 
provides a reference for how quickly the error approaches Zero. 

The sequence {G,}, which is typically taken to be of the form 1/n® or 1/a” for 
some positive constant a, serves as a benchmark and allows for ease of comparison 
between different sequences. For example, a sequence with rate of convergence 
O(1/n?) converges more slowly than one with a rate of convergence O(1/n'°), which 


in turn converges more slowly than a sequence with rate of convergence O(1/2"). 


EXAMPLE 1.4 Comparing Rates of Convergence 


Consider the sequences 
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n (n+3)/(n +7) (2% +3)/(2% 47) 
1 0.5000000000 0.5555555556 
2 0.5555555556 0.6363636364 
3 0.6000000000 0.7333333333 
4 0.6363636364 0.8260869565 
5 0.6666666667 0.8974358974 
6 0.6923076923 0.9436619718 
7 0.7142857143 0.9703703704 
8 0.7333333333 0.9847908745 
9 0.7500000000 0.9922928709 


TABLE 1.2: Corresponding Terms in Two Sequences that Converge to 1 


Since 
. nt3 
lim ——= 
noo n+7 


an 


Pe an 


1, 


it follows that both sequences converge to the limit 1. Although these two sequences 
have the same limit value, as seen in Table 1.2, the terms in the sequence (2” + 
3)/(2" +7) appear to be approaching 1 much more rapidly than the terms in the 
sequence (n+ 3)/(n +7). 

We now determine the rate of convergence of each sequence. After some basic 
algebra, we find 


Lea eT as 
n+7 n+7 n 


Hence, we may take A = 4 and @, = 1/n in the definition of rate of convergence. 
It follows that the sequence 
Nn+3 
(a7 


converges to 1 with rate of convergence O(1/n). Working in a similar manner, we © 
find that 


ret fre Act 
2°47 a a Pie 


2° +3 | 4 1 


Hence, we may take \ = 4 and G, = 1/2” in the definition of rate of convergence, 


so the sequence 
27 +3 
2°47 


converges to 1 with rate of convergence O(1/2"). These results confirm our numer- 
ical evidence since 1/2” approaches zero faster than 1/n as n — oo. 
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In later chapters, many of the methods that will be developed will have the- 
oretical error bounds that are expressed as a function of a method parameter. For 
example, most of the numerical integration techniques covered in Chapter 6 will 
have error bounds expressed in terms of the parameter h, the spacing between the 
points at which the integrand is sampled. To facilitate the comparison between 
different techniques, it will be useful to have big-O notation defined for functions. 


Definition. Let f be a function defined on the interval (a,b) that contains 
x = 0, and suppose limyg-.9 f(z) = L. If there exists a function g for which 
limo g(x) = 0 and a positive constant K such that 


|F(x) — | < Klg(x)| 


for all sufficiently small values of z, then f(z) is said to converge to L with 
RATE OF CONVERGENCE O(g(z)). 


In these instances, the benchmark function g(x) will tend to be of the form x* 
for some positive exponent a. An error term with rate of convergence O(x) then 
approaches zero more slowly than an error term with rate of convergence O(z*), 
say, as the value of z approaches zero. 


EXAMPLE 1.5 Determining Rate of Convergence for a Function 


Consider the function 
fe) = cose —1+4 $2? 
x4 . 
What is the limit of f as z + 0? Furthermore, at what rate does f converge to 
this limit? 

We can actually answer both of these questions simultaneously by using Tay- 
lor’s Theorem. From Taylor’s Theorem (which will be reviewed at the end of this 
section), we know that 

1 


1 2 4 1 6 
=-j-e + — — —rz cos€, 
coors 2” 24° 720 é 


for some € between 0 and z. Hence, 


cosp—1+5a? 1 oe) 
ae ~ 24 720 


Finally, because 


cosr-~1l+32? 1 | _ 
x4 24| 720 


it follows that limz 49 f(z) = A and the rate of convergence is O(z?). 
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Order of Convergence 


Order of convergence provides a different measure of convergence speed than rate 
of convergence. Whereas rate of convergence examines individually the terms in 
the sequence of error values, ¢, = PD, —p, order of convergence examines the re- 
lationship between successive error values, measuring the effectiveness with which 
each iteration reduces the approximation error. 


Definition. Let {p,} be a sequence that converges to a number p. Let e, = 
Pn —p for n> 0. If there exist positive constants A and a such that 


[Pn41 — P| = lim len-+2| —} 


no [Dy — pl® 00 feg|e 


then {p,} is said to converge to p of ORDER @ with ASYMPTOTIC ERROR 
CONSTANT A. 


Jt follows that for a sequence that converges of order cy, the error satisfies the 
asymptotic relation |@n43| # Alen|*. 

An iterative method is said to be of order a if the sequence it generates 
converges of order a. The most common values of a in practice are: a = 1 (also 
known as linear convergence}, a = 2 (quadratic convergence) and a = 3 (cubic 
convergence). Noninteger values for a are possible. Note that when a = 1, the 
sequence of error values satisfies 


lenti| © Alen| * Mena od Mlen—al ress Fe A leg], 


Hence, a linearly convergent sequence converges with rate of convergence O{A"). 

To demonstrate the difference between the various orders of convergence, sup- 
pose there are three methods, one linear, one quadratic, and one cubic, all being 
applied to the same problem. Each method has an asymptotic error constant of 
A = 0.5, and there is unit error in the initial approximation (ie, Jeg] = 1). The 
chart below displays the error associated with each method through several itera- 
tions. 


LINEAR — QUADRATIC CUBIC 
= [engi] ¥0.5|en| => lengi| #O5]en|? > lens] = O5]e,|* 
lex] 0.5 0.5 0.5 
le2| 0.25 0.125 0.0625 
leg] 0.125 7.8125 x 1078 1.2207 x 1074 
leg} 0.0625 3.0518 x 107 9,0949 x 10-13 
les| 0.03125 4.6566 x 107%° 3.7616 x 10737 
Jeg} 0.015625 1.0842 x 19719 
le7| 7.8125 x 107? 5.8775 x 107-59 


Note the dramatic difference between the linear and quadratic methods. The 
linear method would take more than 100 iterations to achieve the accuracy attained 
by the quadratic method in just seven iterations. Even the more modest accuracy 
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achieved by the quadratic method in five iterations would take the linear method 
31 iterations. Unless each iteration of the quadratic method requires significantly 
more work than each iteration of the linear method, the linear method will never 
compete with the quadratic. On the other hand, there is only a slight difference 
(two or three iterations) between the quadratic and cubic methods. In practice, the 
extra work needed to achieve cubic convergence would therefore not be justified. 


EXAMPLE 1.6 Determining Order of Convergence 


In Section 1.1, we used the recursive scheme 


a a 
tnt. = 3 Ln + — (1) 


In 


to compute an approximation to the square root of a positive real number a. Here, 
we would like to determine the order of convergence of the generated sequence. To 
accomplish this, we must be able to compare the error in the (n + 1)st term in the 
sequence, 2n+11 — \/@, with the error in the nth term, z, — Ja. 

We start by subtracting ./a from both sides of (1) and performing some basic 
algebra. This yields 


af a 
ati ~Va= 5 (a +2) — va 
nm 


2 
eo — 2tynJ/ata 
- 22n, 
2 
= (En _ Ja) 
Wn 
Accordingly, 
\En41 fez va . 1 1 


Tate IVS oS, es ek oe 
nite [tin — Va noo 2t, 2a 


Hence, the sequence generated by this scheme has order of convergence equal to 2 
and asymptotic error constant 1/(2,/4). 


a 


A common task throughout the remainder of the text will be confirming a 
theoretical order of convergence using numerical data. For example, we have just 
established, theoretically, that the sequence generated by equation (1) should con- 
verge of order 2. Does the sequence actually achieve quadratic convergence in 
practice? To answer this question we need to select an @, generate the resulting 
sequence and examine the ratio |en|/len—1 |?. If this ratio approaches a constant as 
n increases (the ratio should, in particular, approach the asymptotic error constant 
1/2,/a), then we have numerical evidence of quadratic convergence. 
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EXAMPLE 1.7 Numerical Verification of Quadratic Convergence 


With a = 9 and zo = 9, the first five terms of the sequence generated by equation (1) 
are listed in the second column of the following table. The absolute error in each 
term in the sequence (remember that the sequence is supposed to converge to 
Ja = V9 = 3) is given in the third column of the table. For our present purposes, 
the most iraportant information in the table is found in the fourth column, which 
shows the ratio |en|/len—1|?. 


tn lén| = |2n — 3| len|/len—a1” 
9 6 
5 2 0.055556 
3.4 0.4 0.100000 


3.02352941176471 2.35294 x107? 0.147059 
3.00009155413138 9.15541 x107° 0.165370 
3.00000000139698 1.39698 x10~° 0.166661 


apwnros 


Note that the ratio |en|/|én—i|* approaches a constant, thereby providing 
numerical confirmation of the quadratic convergence of the sequence. Further, 
the error ratio appears to be approaching 1/6 = 1/(2V9), providing numerical 
confirmation that the asymptotic error constant for equation (1) is \ = 1/(2,/e). 


Review of Taylor’s Theorem 


Taylor’s Theorem is an important tool in many branches of mathematics, including 
numerical analysis. The theorem indicates how to construct a polynomial approxi- 
mation for a sufficiently differentiable function. 


Theorem. Suppose f is continuous on [a,b], has m continuous derivatives on 
(a,b) and f+) exists on [a,b]. Let zo € [a,0]. For every x € [a,6] there 
exists a number €(x) between x and 2 such that 


Ff (2) = Pra(z) + Ra(2), 


_ fOME)) 


ae DI ae 


(x - x9 


Here, P, is called the nth-degree Taylor polynomial for f about z = Zo. In 
practice, P,,(x) is used as an approximation to f(x) for values of x near ¢ = 2p. 
The term R(x) is called the remainder term associated with P,. For each z, the 
remainder term gives the error incurred by using P,(z) to approximate f(a). In 
practice, the exact value of € is rarely known, so the remainder term is used to 
determine an error bound rather than the actual approximation error. 
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EXAMPLE 1.8 A Taylor Polynomial and Its Remainder Term 


We wil] now use Taylor’s theorem to obtain the second-degree Taylor polynomial 
and its associated remainder term for the function f(z) = /z about x9 = 16. To 
determine the coefficients in the second-degree Taylor polynomial, we need f and 
its first two derivatives. For the remainder term, we will need the third derivative 
of f. Starting from f(x) = /z, we compute 


f(z) = ag V2, fas 1p -8? and fa) = 35/2, 
2 4 8 
Therefore, 
f (zo) = V16 =4 
on ee eee 
f'(zo) = AG 
n _ 1 ae 1 
Po) = — Tig ~~ 386 
” 3 
10) = gen 
Finally, 


f(t) = Pa(a) + Re(z) 


ee eat enema ae 


— 16)°. 
8 512 (pets) 


1 
16£5/2 


Suppose we now take z = 17. Using the Taylor polynomial and remainder 
term we just calculated, we find 


fA 
VIt = f(T) © Pa(1T) = 4+ 5 ~ gry = 4128040875 


with an absolute error given by 


> 


|RA(17)| = esr 


where 16 < £ < 17. Because € must be larger than 16, it follows that 


i 1 
[ell] < sa aepe = 


—— 6.10 x 107°. 
16-169/2 16384 


This last inequality provides what is called a theoretical error bound. The value 
of P,(17) can differ from V17 by no more than this amount. In fact, the actual 
difference between P,(17) and V17 is roughly 5.88 x 1075. 
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We conclude this section by stating the nth-degree Taylor polynomial and 
associated remainder term for several common functions. The derivations of these 
expressions are left for the exercises. 


n+] 


Ee ee ear er gee ace eé 
2 nt (n+1)! 
Pipes 3 : 7 i i (-1)? 2204} (—1)?ttyents : 
Sap @n +i)! Qneay ms 
2 4 Nad ntl ,,2n+2 
£ (—D"ae?  (-1)? te 
cosa =1—-—+— —- +--+ 4+ ~— + 
a a + ny! (an ay °85 
1 (-1)7tigrti 
=l-24+2° see (1)? 2? + ———_. 
Ree + ie a a + rent 
EXERCISES 
1. Compute each of the following limits and determine the corresponding rate of 
convergence. 
(a) iIMnsoo ae 
(b) limnsoo (Vn +1- Vn) 
(c) limps 182 
. n2— 
(d) limnoo at pata 
2. Compute each of the following limits and determine the corresponding rate of 
convergence. 
(a) limz—o © = 


(b) limz—o sin z 


(c) 


limz—o 


rd 
é* ~cosa@—% 
pa EE at 
=z 


_ 2/5 24 
(d) limz—+o cosz Lite (2—2* /24 


3. Numerically determine which of the following sequences approaches 1 faster, and 
then confirm the numerical evidence by determining the rate of convergence of 
each sequence. 


_ sing? _ (sine)? 
lim 5 versus lim ~—>-— 
a0 2£ xz—-0 x 


4. Suppose that 0 <a <b. 


(a) Show that ifan =a+ O(1/n*), then an =a+ O(1/n*), 
(b) Show that if f(z) = 1+ O(2°), then f(x) = L+ O(x?). 


. Suppose that fi(z) = £1 + O(x*) and fo(z) = Ly + O(x°). Show that 


er filz) + c2fe(x) = aL, + col2 + O(2'), 


where c = min(a, 6). 
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10. 


. The table below lists the errors of successive iterates for three different methods 


for approximating “7/5. Estimate the order of convergence of each method, and 
explain how you arrived at your conclusions. 


Method 1 Method 2 Method 3 
4.0 x107? 3.7 x1074 4.3 x1073 
9.1 x1074 1.2 x10715 1.8 x1078 
4.8 x10-? 1.5 x19~ 60 1.4 «10774 


. Let {pn} be a sequence that converges to the limit p. 


(a) If 


what can be said about the order of convergence of {pn} to p? 
(b) If 


what can be said about the order of convergence of {pn} to p? 


. Suppose theory indicates that the sequence {pn} converges to p of order 1.5. 


Explain how you would numerically verify this order of convergence. 


. Theory indicates that the following sequence should converge to V3 of order 


1.618. Does the sequence actually achieve an order of convergence of 1.618? If 
not, what is the actual order? 


Pn 
2.0006000000000000 
1.666666666666667 
1.727272727272727 
1.732142857142857 
1.732050680431722 
1.732050807565499 


Theory indicates that the following sequence should converge to 4/3 of order 
1.618, Does the sequence actually achieve an order of convergence of 1.618? If 
not, what is the actual order? 

n 


a pwnwmr os 


Pn 
1.498664098580016 
1.497353997792205 
1.428801977335339 
1.401092915389552 
1.376493676051456 
1.361345745573130 
1.351034482500881 
7  1.344479850695066 


Ow wn © 


11. Show that the convergence of the sequence generated by the formula 


a + 3tna 
308 +4 


In4+1 = 


toward s/a is third order. What is the asymptotic error constant? 


12 


13. 


14, 


15. 


16. 


17. 
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. Let @ be a nonzero real number. For any xo satisfying 0 < rq < 2/a, the 
recursive sequence defined by 


In41 = In{2— atn) 


converges to 1/a. What are the order of convergence and the asymptotic error 
constant? 
Suppose that the sequence {pn} converges linearly to the limit p with asymptotic 
error constant A. Further suppose that pr41—P, Pn — p and pn_) ~p are all of 
the same sign. Show that 

Pnt1 — Pn meh 

Pn — Pn-1 
A sequence {pn} converges superlinearly to p provided 


Show that if pn — p of order a for a > 1, then {pn} converges superlinearly to 
D. 
Suppose that {pn} converges superlinearly to p (see Exercise 14). Show that 


lim IPnti — Pn| & 


noo [Pn — pl 


(a) Determine the third-degree Taylor polynomial and associated remainder 
term for the function f(z) = In(1 — a). Use zo =0. 

(b) Using the results of part (a), approximate In(0.25) and compute the theoreti- 
cal error bound associated with this approximation. Compare the theoretical 
error bound with the actual error. 

(c) Compute the following limit and determine the corresponding rate of con- 
vergence: 

— In(l—a2)+2+ 42? 
lim. 3 : 
z0 x 


(a) Determine the third-degree Taylor polynomial and associated remainder 
term for the function f(z) = /1+ a. Use zp = 0. 


(b) Using the results of part (a), approximate V1.5 and compute the theoretical 
error bound associated with this approximation. Compare the theoretical 
error bound with the actual error. 


(c) Compute the following limit and determine the corresponding rate of con- 


vergence: 
wvitae-l- $2 
lim i ad . 
z—0 x 


In Exercises 18-21, verify that Taylor’s theorem produces the indicated formula, where 
€ is between 0 and z. 


18. 
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19. 
3 5 2n+1 
‘ x (-1)"z + (—1)P tl p2nt3 
snz=2r7—-—4+2-4...4 4" iris FU, 
at" 5 Qn+D! + @naayr 2s 
20. 
2 4 2: 
£ xr (-1)"x ne (-1)?41z2n+2 
cosz=]1—-—+4+-~4... 
a a tn tang a OSE 
21, 


I 
l+z2 


~1)Pt1 pnt) 
Q¢peag Hare ee 
( ) + (1+ €)nt2 


1.3. MATHEMATICS ON THE COMPUTER FLOATING POINT 
NUMBER SYSTEMS 


Any meaningful discourse on numerical methods/numerical analysis must include 
a discussion of errors. After all, numerical methods are generally designed to de- 
termine approximate solutions. Sources of error can be broadly grouped into four 
categories: 


e modeling error; 
¢ discretization and truncation error; 


roundoff and data error; 
e human error. 


‘The assumptions that are made during the model building phase (as described in the 
overview to this chapter) give rise to equations which are at best approximations to 
the system being studied. A quantification and classification of these approximation 
errors can be found in most textbooks on mathematical modeling but is beyond 
the scope of this text. 

Many of the techniques that are developed in the forthcoming chapters in- 
volve the conversion of a continuous problem into a discrete one. This conversion 
process introduces what are referred to as discretization errors. Still other tech- 
niques involve the truncation of an infinite series, giving rise to truncation errors 
in the approximation solution. As methods are developed, these types of error will 
be examined in detail. Although no one likes to admit it, we all make program- 
ming errors and computation errors. Great care must be taken to ensure thai all 
human errors are detected and corrected. Checking programs with test problems 
whose exact solution is known to verify that theoretical error bounds and rates of 
convergence are satisfied is a powerful technique for achieving this goal. 

Unlike discretization and truncation errors, which arise due to the formulation 
of a numerical method, roundoff and data errors are inherent in the way that 
computers and data acquisition hardware represent real numbers. The objectives of 
this section are to examine the representation of rea] numbers on computers and to 
make a precise definition of roundoff error. The important concept of conditioning 
is also introduced. Discussions of floating point arithmetic, the accumulation of 
roundoff error through a sequence of calculations, and the types of operations that 
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should be avoided in the construction of a numerical algorithm will be deferred to 
the next section. 


An Example to Set the Stage 


As part of a laboratory experiment, a group of students needs to calculate the 
modulus of elasticity, E, of a steel beam. An object of mass m = 0.491 kg is 
suspended from one end of a beam whose length is | = 0.451 m, whose width is 
a@ = 0.021 m and whose thickness is 6 = 0.003 m. The resulting deflection of the 
tip of the beam is measured to be d = 0.142 m. Substituting these values into the 
formula 


_ 4mgl? 
~ dabs ’ 
where g = 9.81 m/s? is the acceleration due to gravity, the students calculate 
4(0.491)(9.81)(0.451)? 
pe HOS UO SUOSD oo oy08) 10) Nae, 


(0.142)(0.021)(0.003)3 


A standard table of the properties of steel, however, indicates that the actual value 
should be E = 30 x 10° N/m?. Is the value calculated by the students within 
acceptable limits of the tabulated value? 

To answer this question, we must recognize that all physical measurements are 
made with finite precision and hence include some amount of error. For instance, if 
it is assumed that all of the measured values given above have been rounded to the 
digits shown, then the mass of the object that has been suspended from the beam 
can actually be anywhere between 0.4905 kg and 0.4915 kg. Similarly, the length 
of the beam is between 0.4505 m and 0.4515 m, the width of the beam is between 
0.0205 m and 0.0215 m, the thickness of the beam is between 0.0025 m and 0.0035 
m, and the deflection of the tip of the beam is between 0.1415 m and 0.1425 m. 
Since each measured value is really an interval, the equation for the modulus of 
elasticity should be used to determine a range of possible values: 


4(0.4905) (9.81) (0.4505) ‘ 4(0.4915)(9.81)(0.4515)? 


(0.1425)(0.0215)(0.0035)2 ~~ ~ (0.1415)(0.0205)(0.0025)°” 


or 
13.397 x 10° N/m? < E < 39.165 x 10° N/m’. 


Note this range includes the tabulated value of E = 30 x 10° N/m?. Therefore, 
taking into account the accuracy of the measurements, the value calculated by the 
students is within acceptable limits of the tabulated value. 


Floating Point Number Systems 


Although practical problems deal with real valued quantities and the theorems upon 
which numerical methods are based are written in terms of real valued functions, 
computers, like measurement devices, have no concept of the real numbers. Instead, 
computers represent real numbers in what is known as a floating point number 
system. 
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Definition. A FLOATING Point NUMBER SYSTEM, F(6,k,m, M), is a sub- 
set of the real number systern characterized by the parameters 


B: the base 

k: the number of digits in the base 8 expansion 
m: the minimum exponent 

M: the maximum exponent 


Elements of F(8,k,m, M) are those real numbers that can be expressed ex- 
actly as 


+ (0.d,dgd3--- dy), x B*, 


where m <e< M. The first base @ digit, d,, must be nonzero, except when 
the number being represented is 0, in which case d; = 0. 


The restriction is made on the first digit, d|, to guarantee that each element 
in the set has a unique representation. Computers primarily use 8 = 2, or a binary 
number system, though some computers use @ = 16, or a hexidecimal number 
system. Handheld calculators typically use a decimal number system; i.e., 8 = 10. 
Commonly used values for k, m and M will be noted later in the section and in the 
exercises. 

There are three important ways in which a floating point number system 
differs from the real number system. First, a floating point number system is 
a discrete set. In contrast, the real number system is continuous, meaning that 
between any two real numbers there are infinitely many other real numbers. Second, 
a floating point number system is a finite set, whereas the real number system is 
an infinite set. Here, finite refers not only to the number of elements in the set, 
but also to the range of values. A floating point number system contains both a 
smallest and a largest positive element, as well as a smallest and a largest negative 
element. The real number system has no such elements. Third, real numbers are 
uniformly distributed while finite precision imposes a nonuniform distribution upon 
the elements of a floating point number systera. In particular, the elements near 
zero are more closely spaced than the elements at the extremes of the representable 
values. 

As an example, consider the system 


F(10, 2, 0,2) = {0, 
+0,10,+0.11, +0.12,..., £0.19, 


+ 0.90, +£0.91, £0.92, ..., +£0.99, 
+1.0, £11, £1.2,...,41.9, 


+ 9.0, £9.1,49.2,...,£9.9, 
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Figure 1.1 The positive elements from the floating poimt number sys- 
tem F(10, 2,0, 2). 


410,411, +12,...,+19, 


+90, +91, £92,..., +99}. 


Figure 1.1, which displays the positive elements from F(10, 2, 0, 2), clearly illustrates 
the discrete nature of the number system. Whether we examine the list of elements 
given above or the figure, we see that the smallest nonzero elements in F(10, 2,0, 2) 
are +0.10, while the elements of largest magnitude are +99. Taking into account 
both positive and negative values, the system contains only 54] elements. Finally, 
Figure 1.1 also makes it clear that the elements in this system are not uniformly 
distributed along the number line. In particular, there is a gap of one-tenth between 
0 and the smallest nonzero numbers, elements are separated by a distance of one- 
hundredth in the range from 0.1 to 1.0, by a distance of one-tenth in the range 
from 1.0 to 10.0 and by a distance of 1 from 10 through 99. 

_ The fact that a floating point number system has smallest nonzero elements, 
as well as largest elements, carries certain practical implications. Suppose, for 
instance, that a calculation produces a number that falls between zero and one of 
the smallest nonzero elements of the system [e.g., the operation 0.20/87 performed 
in F(10, 2,0,2)). Because the operation has resulted in a number that is too small 
to be represented in the system, we say that an underflow has occurred. Typically, 
underflow is handled by replacing the number with zero. On the other hand, a 
calculation that produces a number that is too large to be represented in the system 
generates an overflow exception. An example of an overflow exception would be the 
calculation 5.7 x 43 performed in F(10,2,0,2}. Overfiow usually causes a process 
to halt execution. 
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Roundoff Error 


Suppose a hypothetical computer uses F(10,2,0,2) as its floating point system. 
Granted, this is a very crude computer, but let’s stick with it for the moment. A 
given calculation requires the value /7.1 ~ 2.66458. Since this number requires 
more than two digits, it is not a member of the set F(10,2,0,2). An element 
from the set must therefore be selected to represent 7.1. The two most natural 
approaches to take would be to either drop all digits after the second one, producing 
the approximation 7.1 + 2.6, or to round the number to two digits of accuracy, 
producing the approximation /7.1 = 2.7. The former approach is typically referred 
to as chopping the number, while the latter is known as rounding the number. 
In the general case, let y be a real number whose expansion is given by 


y = + (0.dydeds3---dgde4i--: \e x fe 


with d; #0 and m <e< M. Denote by fl(y) € F(G,k,m, M) the floating point 
equivalent of y (.e., that element from the floating point system that will be used 
to represent y). When the number is chopped, the floating point equivalent is given 
by 

flenop(y) = + (O.dydads ---dk)g x B%; 


when the number is rounded, 


7 + (0.d,dgd3---de)g x 8, dysy < 8/2 
flrouna(y) a + [(0.didods ie “dk)g + p-*] x BE, dn > B/2 


Regardless of whether the number is chopped or rounded, the conversion of 
a number into its floating point equivalent introduces some amount of error. This 
type of error is known as roundoff error. 


Definition. The error introduced by converting a real number to its floating 
point equivalent is called ROUNDOFF ERROR. 


The standard metrics of absolute and relative error, as defined below, are used 
to quantify the effect of roundoff. 


Definition. Let p* denote any approximation to the value p. The ABSOLUTE 
ERROR in p* is given by 
Ip" — pl 


The RELATIVE ERROR in p* is 
Ip" — pi/\pl, 


provided that p # 0, and is usually, though not always, expressed as a per- 
centage. 
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For the example involving V7.1, the absolute and relative errors associated 
with chopping the number are 


|2.6 — 2.66458| = 6.458 x 107? 


and 
2.6 — 2.66458] 


266458 0.0242 = 2.42%, 


respectively. With rounding, the corresponding errors are 3.542 x 107? and 1.33%. 
In the general case, consider chopping first. A bound on the absolute size of 
the roundoff error is 


|flcnop(y) — y| = (Odes ide+adn43 + )g x a? 
S (1.0), x pe* 
as pe-*. 


To bound the relative error, a lower bound for |y| is needed. Provided y 4 0, given 
the restriction on the value of d,, , 


ly| = (0.dyded3 roe Mp x Bp 


Therefore, 


[flcrop(¥) — vl. B°* 
ly| py acs 
By proceeding in a similar manner, it can be shown that when a number is rounded, 


the bounds on both the absolute and relative error due to roundoff are one-half the 
bounds obtained when a number is chopped. That is, 


= Birk. 


\flround (y) — y| < sar 


and 


[flrouna(y) — y| 2 dank 
\y| 2 


Note that the bound on the relative error due to roundoff, regardless of 
whether the number has been chopped or rounded, is independent of the num- 
ber. The bound depends only on the base, 6, and the number of digits, k, of the 
floating point system in use. This bound is therefore a function of the hardware 
implementation; accordingly, the phrase machine precision, or machine epsilon, is 
often used. 
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Definition. The MACHINE PRECISION, wu, is given by 


ua! BOS, chopping 
56'-*, rounding, 


where ( is the base and k the number of digits in the implemented floating 
point number system. 


Suppose that z and y are two nearly equal numbers. Because floating point 
numbers are represented using only a finite number of digits, it is natural to measure 
the “closeness” of two numbers not only in terms of absolute and relative difference, 
but also in terms of the number of significant digits which they have in common. 


Definition. Suppose that z # 0 and that 


z~y 


xz 


Bo Ut) < < gt 


for some positive integer t. Then we say that x and y agree to at least ¢ and 
at most t+ 1 SIGNIFICANT base § DiaiTs. 


Clearly, if two numbers agree to at least ¢ significant base @ digits, these 
numbers will be indistinguishable in any floating point number system with base @ 
and k <t. 


EXAMPLE 1.9 How Close Are Two Numbers 


Consider the numbers cos(0.1°) = 0.999998476 and cos(0.11°) = 0.999998157. 


Since 
cos(0.1°) — cos(0.11°) 


= 3.198 x 107" 
cos(0.1°) 


and 

10-7 < 3.198 x 10-7 < 107°, 
it follows that cos(0.1°) and cos(0.11°) agree to at least 6 and at most 7 decimal 
digits. To how many significant binary digits do these two cosine values agree? 


Since 
9-2 — 9.384 x 1077 < 3.198 x 107” < 4.768 x 1077 = 27>”, 


we see that cos{0.1°) and cos(0.11°) agree to at least 21 and at most 22 binary 
digits. 


The IEEE Standard 


To this point, we've been dealing with floating point number systems in the ab- 
stract. But what systems are we likely to encounter in practice? In the 1970s, work 
was begun to develop standards for the representation and arithmetic of binary 
(@ = 2) floating point numbers on microprocessors. One of the major objectives 
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was to eliminate inconsistencies when moving code from one machine to another. 
The culmination of this effort came in 1985 with the publication of the report Bi- 
nary Floating Point Arithmetic Standard 754-1985 by the American IEEE (Institute 
for Electrical and Electronics Engineers) computer society. This report contained 
specifications for the representation of floating point numbers, the elementary op- 
erations and rounding rules available, the rules for converting between number 
systerns, and the handling of exceptional cases. In 1989, the International Elec- 
trotechnical Commission let the IEEE standards become international standards. 
Today, these standards are generally adhered to by microprocessor manufacturers. 

The IEEE standard actually specifies two different floating point formats: a 
basic format and an extended format. Each format contains both a single preci- 
sion and a double precision number system. The basic format includes the single 
precision floating point number system F(2, 24, —125, 128) and the double precision 
system F(2,53, —1021, 1024). For the extended format, only lower or upper bounds 
on number system parameters are provided. In particular, an extended single pre- 
cision system must have k > 32, m < —1021, and M > 1024. An extended double 
precision system must have k > 64, m < —16381, and M > 16384. 

Let’s examine the IEEE standard single precision number system in more 
detail. With @ = 2 and k = 24, machine precision with rounding is 


1 
us ae == 2774 ws 5.96 x 1078. 
Accordingly, there are between seven and eight significant decimal digits available 
in single precision. The smallest positive number in single precision is 


(0.12 x 27195 = 27196 118 x 10738, 
while the largest positive number is 
(OL ter x DP et ole Sno 0 


An examination of IEEE standard double precision, as well as several other number 
systems that can be found on specific processors, is left for the exercises. 


Conditioning 


In the next section, several examples will be used to demonstrate that certain oper- 
ations can lead to a dramatic and devastating accumulation of roundoff error when 
a sequence of finite precision calculations are performed. There are also, however, 
mathematical problems for which a small change in input data, such as brought 
about by roundoff, leads to large changes in the analytical solution to the problem. 
Here, input data can refer to, among other things, the coefficients of a polynomial 
whose roots are being computed, the initial and/or boundary values associated with 
a differential equation, and the tabulated values from which a derivative and/or an 
integral needs to be approximated. 
Consider the initial value problem 


a -g=e*, #(0) = —1/3, 


38 Chapter 1 Getting Started 


a x(0) =-1/3 
see ee x(0) =-1/3 +e { 


Figure 1.2 Comparison between the salution of 2’ — 2 = e~** subject 
to the initial condition z(0) = —1/3 and (0) = ~1/3+e with e = 107”. 


whose exact solution is z(t) = —e~%/3. If the initial condition is changed to 
z(0) = -1/3 + ¢, the exact solution of the problem becomes x(t) = —eet — e~7*/3. 
No matter how small the perturbation, ¢, to the initial condition, the difference 
between the two solutions, ee’, grows without bound. Figure 1.2 plots both solutions 
with ¢ = 1077. 

As another example, take the polynomial 


P(x) = (x — 1)(a — 2)(@ — 3)(@ — 4) (2 — 5) (a — 6) (2 — 7)(z - 8)(x — 9)(x — 10), 


The roots of this tenth-degree polynomial are clearly the consecutive integers from 1 
through 10, inclusive. Consider the perturbed polynomial P(x) = P(z) + 2°. The 
coefficient of z° in P(x) is —902055, so the relative change in this coefficient in 
P({zx) is roughly one-thousandth of one percent. Figure 1.3 displays the roots of 
both P(x) and P(x). The relative change in the first two roots is on the order of 
the change in the coefficient of «°, while the changes in the third and fourth roots 
are roughly 1% and 27%, respectively. The final six roots have been transformed 
into three complex conjugate pairs with definite nonzero imaginary parts. 
Problems like these two, for which a small change in input data results in 
a large change in the output, are said to be ill conditioned. It is important to 
note that this ill conditioning is inherent in the mathematical problem itself and 
is not an artifact of any numerical computation scheme. In forthcoming chapters, 
the conditioning of the various mathematical problems being investigated will be 


discussed where appropriate. 


EXERCISES 


1. 
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Figure 1.3 Roots of the polynomial P(z), 0, versus those of the poly- 
nomial P(x), *, in the complex plane. 


Provide the floating point equivalent for each of the following numbers from the 
floating point number system F(10, 4,0, 4). Consider both chopping and round- 
ing. Compute the absolute and relative error in each floating point equivalent. 


(a) 7 (b) e 
(c) v2 (d) 1/7 
(e) cos 22° (f) In10 
(g) V9 


. Prove the bounds on the absolute and relative roundoff error associated with 


rounding: 


ieee aun —4 1 oj 
|flrouna (4) —yl < ae k and [Proonatt) BS a - 


. Show that machine precision is the smallest floating point number, v, such that 


fll+v)>1. 


(a) Construct an algorithm to determine machine precision and another algo- 
rithm to determine the smallest positive number of a floating point number 
system. 

(b) Implement the algorithms from part (a) to determine machine precision and 
the smallest positive number on your computing system. Consider both 
single and double precision. 
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10. 


11. 


12. 


13. 
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(c) Assuming that your computing system uses 6 = 2 and rounding, use the 
results from part (b) to determine the values for k and m. 


- Determine machine precision, the smallest positive number and the largest pos- 


itive number for the floating point number system used by your calculator. As- 
suming the calculator uses 8 = 10, determine the values for k,m, and M. 


. Determine the number of significant decimal digits and the number of significant 


binary digits to which each of the following pairs of numbers agree. 
(a) 355/113 and 7 

(b) 685/252 and e 

(c) 10002 and V10001 

({d) 103/280 and 1/e 


- The ideal gas law states that PV = nRT, where P is the pressure of the gas, 


V is the volume, 7 is the number of moles, T is the temperature, and R = 

0.08206 atm - m3/moles - K is the universal gas constant. 

(a) Experimentally, it has been determined that P = 0.750 atm, V = 1.15 m°, 
and T’ = 204.1K. Assuming that all values have been rounded to the digits 
shown, in what range of values does 7 fall? 

(b) Experimentally, it has been determined that V = 0.331 m°, n = 0.00712 
moles, and T = 264.7K. Assuming that all values have been rounded to the 
digits shown, in what range of values does P fall? 


- In a physics laboratory, students measure the mass of a rectangular block to be 


243.27 + 0.005 grams. The length, width, and depth of the block are measured 

to be 7.8+0.05 em, 3.140.05 cm, and 4.2 + 0.05 cm, respectively. 

(a) In what range of values does the volume of the block fall? 

(b) In what range of values does the density of the block fall? Density is mass 
per unit volume. 


. Students are using a pendulum to experimentally determine the acceleration due 


to gravity, g. They measure the period, T, of the pendulum to be 2.2 seconds, 
and the length, !, of the pendulum to be 1.15 meters. Assuming that all values 
are correct to the digits shown, in what range of values does g fall? The variables 
in this problem are related by the formula T = an/t/g. 


Determine machine precision, the smallest positive number and the largest posi- 
tive number in the IEEE standard double precision system. Approximately how 
many significant decimal digits does the double precision standard supply? 


Tn addition to the standard single and double precision floating point systems, 
Intel microprocessors also have an extended precision system F(2, 64, —16381, 
16384). Determine machine precision, the smallest positive number and the 
largest positive number for this extended precision system. 


IBM System/390 mainframes provide three floating point number systems: short 
precision F(16, 6, —64, 63), long precision F(16, 14, —64, 63), and extended preci- 
sich F(16, 28, —64, 63). Compare machine precision, the smallest positive num- 
ber, and the largest positive number for each of these number systems. 

A common floating point number system used on modern calculators is F(10, 10, 
—98, 100). Determine machine precision, the smallest positive number and the 
largest positive number for this extended precision system. 


14. 


15. 


16. 


17. 


18. 
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(a) Show that the number of elements in the set F(G,k,m, M) is given by 
1+2(6 —1)6*-*(M —m+1). 

(b) How many elements are in the IEEE standard single precision number sys- 
tem? 

{c) How many elements are in the IEEE standard double precision number 
system? 

Consider the function f(x) = 27 ~ 4x +4, 

(a) What are the zeros of f? 

(b) Suppose we were to change the constant term to 4— 107°. What are the 
zeros of this new function? Relative to the size of the change in the constant 
term, how big is the change in the zeros of the function? 


(c) Now, suppose we were to change the constant term to 4+ 107°. What are 
the zeros of this new function? Relative to the size of the change in the 
constant term, how big is the change in the zeros of the function? 


Consider the linear, first-order differential equation 


(a) Solve this equation subject to the initial condition z(/2) = zp. 

(b) Solve this equation subject to the perturbed initial condition 2(7/2) = 
Zo+e. 

(c) By considering the difference between the solutions obtamed in parts (a) 
and (b), comment on the conditioning of this problem. 


Consider the linear, first-order differential equation 


< = zt = tsint. 

(a) Solve this equation subject to the initial condition z(/2) = 29. 

(b) Solve this equation subject to the perturbed initial condition 2(a/2) = 
2g. he. 

(c) By considering the difference between the solutions obtained in parts (a} 
and (b), comment on the conditioning of this problem. 


Consider. the linear system of equations 


FAY 2200 z 
[yaa] []-" 
(a) Solve the system for the right-hand side vector b = [ 3.2 5.8 Is 
(b) Solve the system for the right-hand side vector b= [ 3.21 5.79 ea 
(c) Solve the system for the right-hand side vector b= [ 3.1. 5.7 ie 


(d) By considering the difference between the solutions obtained in parts (a), 
(b), and (c), comment on the conditioning of this problem. 
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1.4 MATHEMATICS ON THE COMPUTER: FLOATING POINT ARITHMETIC 


The objective of numerical analysis is not just to input numbers to the computer, 
but rather to perform a sequence of calculations to produce a desired result. As 
each floating point operation is performed, a new amount of roundoff error will be 
introduced. It is therefore reasonable to expect a slow accumulation of roundoft 
error as calculations proceed. If positive and negative deviations are randomly 
distributed, it is possible for roundoff error to remain level or even to decrease, The 
opposite extreme is, unfortunately, also possible. Certain mathematical operations, 
when performed in finite precision, can lead to a dramatic accumulation of roundoff. 
A theoretical and experimental investigation of this issue will be presented in this 
section. 


Floating Point Arithmetic 


Since computers have no concept of the real numbers, they also have no concept of 
real arithmetic. Instead, camputers perforin calculations within their floating point 
number system. The scheme for floating point arithmetic that is adopted below 
glosses over many of the precise implementation details, which would of course be 
machine dependent but is sufficient for ilustration purposes. 

Let z and y be any real numbers, and let @ denote one of the binary arithmetic 
operators of addition (+), subtraction (—), roultiplication (x), or division (+). The 
corresponding floating point equivalent operator, @y,, is defined by the relation 


r@ pry = fu say fily)). (1) 


In other words, floating point arithmetic will be assumed to consist of three steps. 
First, each operand is replaced by its floating point equivalent. Next, the exact 
arithmetic is performed. Finally, the result is replaced by its floating point equiva- 
lent. For example, using 4 decimal digit rounding arithmetic, 


(5/3) xp V3 = fu fl(5/3) x fl(v/3)) 
= {1(1.667 x 1.732) 
= fi(2.887244) 
= 2.887. 


Although this scheme glosses over any implementation specific details, the IEEE 
standard requires that relation (1) hold provided that neither underflow nor overflow 
occurs, so it is completely realistic. 

Before turning to a discussion of roundoff error accumulation, it is important 
to note that floating point arithmetic does not satisfy many of the properties of 
real arithmetic that are taken for granted. When considered as real quantities, 
the values 0.1329, 1.543, and 23.21 can be added in any order to obtain the result 
24.8859. However, in 4 decimal digit rounding arithmetic, 


(0.1329 + 1.843) + 23.21 = 1.676 + 23,2] = 24.89, 
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but . 
0.1329 + (1.543 + 23.21) = 0.1329 + 24.75 = 24.88, 


Note that all intermediate results have been rounded. Because the two final values 
are not equal, we see that floating point arithmetic is not associative. Note the more 
accurate result is obtained by adding the values in ascending order. In general, this 
will produce the most accurate result. The reasoning behind this is straightforward. 
By starting with the smallest values, information in the least significant bits can 
accumulate and influence the final result. When the numbers above were added in 
descending order, the two least significant digits in 0.1329 were simply discarded. 

The distributive laws also do not hold in floating point arithmetic. Take, for 
example, 

(0.1351 + 23.21) x 1.543 = 23.35 x 1.543 = 36.03. 


When computed as the sum of two products, the result is 
0.1351 x 1.543 + 23.21 x 1.543 = 0.2085 + 35.81 = 36.02. 


The exact value in this case is 36.0214893. 


Accumulation of Roundoff Error 


Now that a framework for performing floating point operations has been established, 
we are in a position to discuss the mechanisms by which roundoff error accumulates 
in floating point operations. Again, let 7 and y be any real numbers, let @ denote an 
exact binary operator and let @-; denote the corresponding floating point operator. 
The difference between the exact value @y and the floating point value c@5,y can 
be broken into two components: 


tOpy — r@y = fl fl(xy@fily)) — r@y 
= [FUP Ue j@fety)) — fi(e)@flty)] + (fi z}B fig} ~ xQy]. 


The frst term, fi(fi(z)@fi(y)) -— fi(ej@flty), is known as the introduced error. 
It represents the roundoff error associated with the final step in the operation; 
[ie., replacing the value fl(#)@fl{y) by its floating point equivalent]. The second 
component, fi(a}@fl{y} - 2@y, measures the effect that the roundoff error in the 
two operands has on the result of the exact arithmetic operation. This component is 
referred to as the propagated error. Whereas the introduced error is small, bounded 
in relative terms by machine precision, the propagated error can be large. 

To examine the propagated error, let 46, denote the absolute deviation and e, 
denote the relative deviation between « and fi(z); that is, 

filaj~a by 
6, = fi(z)-—=z == 

Note that from these definitions, fl(z) can be written as either 7 +6, or £+2é_ = 
x(1+€,). Similar definitions apply for d, and ¢,, and similar expressions can be 
used to express fl (y). 
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Starting with multiplication, it is found that 


fi(z) x fly) = [2+ éx)] x [y(1 + €y)] 
= xy(1 + €e + €y + ex€y) (2) 
= zy(1 + Lay), 


Where Exy = ex + €y + Exly © €y + €y provided lez, [ey| K 1. For division 


f(x) + fly) = [e(1 + .)] + [ya + éy)] 
—(1+€,)(l—ey t+ey +--+) (3) 


eke a 


(1 oo Ex/y)s 
where €y jy % €z — €y. The geometric series expansion 


(l+e)7? =l-gytg-eg+--- 

was used in the second line of (3). From expressions (2) and (3), it is seen that 
with multiplication and division, the relative error propagates slowly in the sense 
that, if the relative error in the operands is small, the relative error in the product 
or quotient will also be small. It should be noted, though, that absolute error can 
grow rapidly when multiplying by a large number or dividing by a small number. 
Whenever possible, algorithms should therefore be written to avoid these situations. 

When working with addition and subtraction, the behavior of absolute and 
relative error is very different from that associated with multiplication and division. 
Consider 


fla) + f(y) = fw + 50] + fy + 5y] 
= (ty) + (6, +y) (4) 


. TEy © Yly 
=(e#+ 1p ||s 
(x (2+ ey ) 


From the second line of (4), it is clear that if the absolute error in the operands 
is small, the absolute error in the result will also be small. However, even when 
the relative error in the operands is small, the relative error in the result, Teru : 
can still be large if + y is close to zero. This will happen when two nearly equal 
numbers are subtracted or when numbers of nearly equal magnitude but different 
sign are added. Drastic propagation of error due to this situation is known as 
cancellation error. This type of error is directly linked to the loss of significant 
digits in the calculation. Algorithms should therefore also avoid the subtraction of 
nearly equal numbers whenever possible. 

We now give three examples in which cancellation error can arise if care is 
not exercised. 
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EXAMPLE 1.10 The Quadratic Formula 


A common mathematical procedure is the solution of a quadratic equation using 
the quadratic formula: 


—b+t Vv b2 — dac 
2a , 


In most circumstances, this formula will produce entirely reasonable and accurate 
results. However, whenever the quantity b? is much, much greater than the prod- 
uct 4ac, the discriminant will have a value very close to b*. The numerator in 
this case will then be roughly equal to —b + |b]. Cancellation error will then occur 
either in the calculation of —b + Vb? — 4ac or in the calculation of —b— Vb? — 4ac, 
depending on the sign of the parameter b. 

As an illustration, consider the quadratic equation 0.22% — 47.91z+6 =0. To 
ten digits, the roots of this equation are 0.1253003555 and 239.4246996. Let’s now 
calculate the roots using 4 decimal digit rounding arithmetic. The computation 
proceeds as follows: 


47.91 + V47.917—4-0.2-6 47.914 2295 — 48 
2-0.2 = 0.4 
_ 47.91 £ 2290 


0.4 
_ 47.91 + 47.85 


0.4 


One root of the quadratic is thus computed to be 429448785 _ 95.76 — 239 4, while 


the second is computed to be 4791=47.85 — 906 _ 0.15. This problem encompasses 
both extremes of roundoff error accumulation. Despite individual roundoff errors 
in the computation of b?, the difference between b* and dac and the square root of 
the discriminant, the larger root is correct to all digits shown—239.4 is the floating 
point equivalent of 239.4246996 in a 4 decimal digit number system with rounding. 
The smaller root, on the other hand, is in error by nearly 20%. The subtraction of 
the nearly equal numbers 47.91 and 47.85 resulted in the loss of three significant 
digits—from four in each of the operands to one in the result—which, in turn, 
produced a large relative error. 

Is there some way to reformulate the calculation of the smaller root so as to 
obtain a more accurate result? The answer is yes. Because b = —47,91 < 0 for this 
problem, cancellation error occurred for the 


—b — Vb? — 4ac 
2a 


portion of the quadratic formula. Rationalizing the numerator of this expression 


yields 
—b— Vb* —4ac © 2c 
2a —b4 Vb? — 4ac 
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as an alternative formula for calculating the smaller root. Note that in the de- 
nominator of this new formula, two nearly equal numbers are added, rather than 
subtracted, so the possibility of cancellation error has been eliminated. Substituting 
the coefficient values into the new formula produces 


2-6 12 
47.914+47.85 95.76 


= 0.1253, 


which is the floating point equivalent of 0.1253003555 in a 4 decimal digit number 
system with rounding. 


EXAMPLE 1.11 A Linear System of Equations 


Consider the solution of the linear system of three equations in three unknowns, 
given in augmented rnatrix form by 


G Sa- <3. (5 
L193 4/8|2 
i 4 = rs 


One pass of Gaussian elimination using exact arithmetic produces the matrix 


6 —-2 3 5 
OA 1 6. |, 
0 10/3 —3/2 | 25/6 


from which the exact solution is computed to be 71 = 3.7, 2 = —1.8, and z3 = —7. 
To arrive at these values, interchange the second and third rows of the reduced 
matrix and perform back substitution. 

To examine the effect of working in 4 decimal digit rounding arithmetic, first 
express the original matrix in floating point: 


6 —2 3 5 
1 0.3333 0.3333 | 2 
1 3 -1 [9 


For the first pass of Gaussian elimination, the pivot is placed in the first row, first 
column, and the multiplier needed to eliminate the 1’s in the first column of the 
second and third rows is fi(—1/6) = —0.1667. The computations for the non-zero 
entries in the new second row produce 


0.3333 + (—2) x (—0.1667) = —0.3333 + 0.3334 = 0.0001; 
0.3333 + (3) x (-0.1667) = 0.3333 — 0.5001 = -0.1668; and 
2 + (5) x (-0.1667) = 2 — 0.8335 = 1.167. 
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Note the loss of significant figures as the result of the subtraction of nearly equal 


numbers in the computation of the element in row 2, column 2. The computations 
for the new third row produce 


3+ (—2) x (—0.1667) = 3 + 0.3334 = 3.333: 
~1+ (3) x (-0.1667) = —1 — 0.5001 = —1.500; and 
5 + (5) x (—0.1667) = 5 — 0.8335 = 4.167. 


The matrix for the next pass of Gaussian elimination is then 
6 -2 3 5 
0 0.0001  -—0.1668 | 1.167 
0 3.333 ~1.500 | 4.167 


The multiplier needed to eliminate the entry in the second column of the third row 
is fl(—-3.333/0.0001) = —33330, resulting in the computations 


~1.500 + (-0.1668) x (—33330) = —1.500 + 5559 = 5558; and 
4.167 + (1.167) x (—33330) = 4.167 — 38900 = —38900. 


A cascade of effects has occurred here. Cancellation error led to a small pivot 
element, which then led to a large multiplier. The use of the large multiplier then 
resulted in the subtraction of numbers with drastically different magnitudes with 
a concomitant loss of significant digits. For example, the value 4.167 originally in 
the last row, last column of the matrix had no effect on the calculation of the new 
element at that location. 

The final reduced matrix now takes the form 


5 
1.167 |. 
0 0 5558 


6 -2 3 
0 0.0001 —0.1668 
—38900 


From the last row, it follows that r3 = —6.999, which is a pretty good approximation 
for the exact value of +3. Substituting z3 = —6.999 into the second row generates 
the equation 0.000122 —0.1668(—6.999) = 1.167, which is equivalent to 0.000122 + 
1.167 = 1.167 in 4 decimal digit rounding arithmetic. The solution of this last 
equation is zo = 0.0000. Finally, substituting the values for x2 and x3 into the first 
row gives the equation 62; + (—6.999) = 5, whose solution is 2; = 4.333. The error 
in 2 is 100%, while that in x is slightly more than 17%. These large errors can be 
directly traced to the loss of significant figures caused ultimately by the subtraction 
of nearly equal numbers. 

One possible strategy for alleviating the accumulation of roundoff error during 
Gaussian elimination will be explored in the exercises. More will be said on this 
subject in Chapter 3. 
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EXAMPLE 1.12 Values of a Function 


Even the seemingly straightforward task of evaluating a function can prove difficult. 
Suppose we need to evaluate the function f(x) = e* — cosx — x for values of x 
very near zero, Because both e* and cosx approach one as x approaches zero, it 
is possible that cancellation error will be a problem. Before we start calculating 
function values, however, let’s do some analysis to determine what we should expect. 

To start, we find that at z = 0, f(0) =0. Next, we turn to the first derivative 
of f, f'(z) =e" +sing —1, For 0 <a <7, e* > 1 and sina > 0, so f’(x) > 0 and 
f is a strictly increasing function. On the other hand, for —a < 2 < 0, e® < 1 and 
sinx < 0, so f’(x) < 0 and f is a strictly decreasing function. Therefore, x = 0 is 
the only zero of f on the interval (—7,7). Lastly, examine the second derivative, 
f"(2) = e® + cosx. For —1/2 < 2 < 7/2, e* > 0 and cosz > 0, so f”(x) > 0 and 
f is concave up. 

Now let’s calculate some function values. Figure 1.4 plots f over the interval 
—5 x 1078 < » < 5x 1078. Points were generated at 1001 uniformly spaced 
abscissas, and all calculations were performed in IEEE standard double precision. 
The overall trend in the graph is in agreement with our analysis, but the fine detail 
clearly is not. For instance, the graph suggests several zeros closely spaced around 
x = 0 as opposed to a unique zero at x = 0. 

Is there some way to reformulate this problem so as to more accurately re- 
produce the fine details of f? Given that the objective is to evaluate f near x = 0, 
it seems natural to try replacing both e* and cos by their respective Taylor series 
about z = 0. This yields 


It is left to the exercises to show that the relative error incurred by using x? “ 
to approximate e” — cosz — x is roughly 10-4 for |z| < 5 x 10-8. Hence, for 
|jz| <5 x 1078, e? —cosr-—2 = et = to machine precision in IEEE standard 
double precision. Furthermore, by rearranging the polynomial as 


3 
x x 
P+ as (1+ 5), 
we see that calculations with x near zero will not involve the subtraction of nearly 
equal numbers, so there should not be any problem with cancellation error. The 
plot in Figure 1.5 confirms this assessment. - 


—_—— Oar 
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Figure 1.4 Graph of the function f(x) = e” — cosz — x computed in 
JEEE standard double precision. 
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Figure 1.5 Graph of the function f(z) = e” —cosz—xz computed from 
the reformulated expression f(x) = x? + = in IEEE standard double 
precision. 
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Just Add More Precision 


A common suggestion for avoiding roundoff error problems is to just use higher pre- 
cision in all calculations. For example, when programming in FORTRAN, change 
variable declarations from REAL*8 to REAL*16. When programming in C/C++, 
change variable declarations from float to double. This procedure does not always 
work, however. 

Consider the following problem introduced by Rump [1] and reconsidered by 
Aberth [2]. Let a = 77617.0 and 6 = 33096.0 and compute 


333.56° + a? (11070? — 06 — 12164 — 2) +.5.568 + ae 


Rump reports that on an IBM System 370 mainframe, using FORTRAN , the fol- 
lowing results were obtained: 


single precision +1.172603 . .. 
double precision +1.1726039400531... 
extended precision +1.172603940053178... 


Although it would be tempting based on these values to claim that, to 7 digits, 
the value of this expression is +1.172603, the true value to 15 decimal places is 
—0.827396059946821. The FORTRAN values do not even generate the correct 
sign. In FORTRAN 77 on a Sun SPARCclassic workstation, the values 


single precision —6.33825 x 1079 
double precision —1,180591627174 x 107! 
extended precision +1.172603940053178 ... 


were obtained. The single and double precision values have the correct sign but 
have dramatically different magnitude than the true value. The extended precision 
value agrees with that obtained by Rump. Finally, the Digits parameter in MAPLE 
Release 5 had to be set to at least 37 to obtain the true value. 


References 


1. Rump, 8. M., “Algorithms for Verified Inclusions: Theory and Practice,” in 
Reliability in Computing, R.E. Moore, ed., Academic Press, San Diego, 1988. 

2. O. Aberth, Precise Numerical Methods Using C++, Academic Press, San Diego, 
1998. 


EXERCISES 
1. Determine the value of each of the following expressions using 4-digit rounding 
and 4-digit chopping arithmetic. For each quantity, compute the absolute and 
the relative error. 
(a) +e cos 22° (b) e/7+ V2inz 
(c) m1n2+ V10cos 22° (d) (In2 — /10 + tan 22°) / (7/9) 
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. Identify the potential roundoff error problems in the following algorithm for 
calculating the roots of the quadratic equation az? + br +c =0. 


GIVEN: real coefficients a, b, ¢ 

STEP 1: calculate disc = Vb? — 4ac 

STEP 2: calculate root] = (—b + disc) /(2a) 
STEP 3: calculate root2 = c/{a- rootl) 
OUTPUT: root! and root2 


Note that this algorithm uses the fact that the product of the roots of az? + br + 
c = O is equal to c/a. 


- Identify the potential roundoff error problems in the following algorithm for 
calculating the roots of the quadratic equation ax? + br +c =0. 


GIVEN: real coefficients a, b, c 

STEP 1: calculate dise = Vb? — dac 
STEP 2: calculate rootl = —2c/(b + disc) 
STEP 3: calculate root2 = —(b/a) — root 
OUTPUT: root! and root2 


Note that this algorithm uses the fact that the sum of the roots of az?+dz-+c = 0 
is equal to —b/a. 

. Construct an algorithm that computes the roots of the quadratic equation az? + 
be+c = 0 and that avoids as many roundoff error problems as possible. Test your 
algorithm by computing the roots of the quadratic equations 0.22* —47,91z+6 = 
0 and 0.02527 + 7 ~0.1=0. Use 4 decimal digit rounding arithmetic in your 
calculations. 


. Show that the relative error incurred by using +e to approximate e* —cos s-@ 

is roughly 1074 for |z| <5 x 1078. 

. In the floating point number system F(10, 10, —-98, 100), subtract each of the 

following pairs of numbers. How many significant decimal digits are lost in 

performing the subtraction, and how does this compare with the number of 
significant decimal digits to which the numbers agree? 

(a) 355/113 and 7 

(b) 685/252 and e 

(c) cos(0.1°) and cos(0.11°) 

(d} 103/280 and 1/e 

. (a) To how many significant decimal digits do the numbers V10002 and V 10001 
agree? 

(b) In the floating point number system F(10, 10, —98, 100), subtract 10001 
from ¥10002. How many significant decimal digits are lost in performing 
the subtraction? 

(c) Explain how you would rearrange your computations to obtain a more ac- 
curate answer. 

. (a) For what values of x does (1 — cos2)/2? = 1/2 to full machine precision. 

Consider both IEEE standard single precision and double precision. (Hint: 

Use Taylor series.) 


(b) Repeat part (a) to determine the values of x for which e~” = 1 to full 
machine precision. 
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12. 


13. 


14, 


15. 


(c) Repeat part (a) to determine the positive values of x for which In(1 + 2) — 
cos 2 — £ = —1 to full machine precision, 


. (a) Plot the function f(z) = 1 — cos2 over the interval —5 x 107° < 2 < 


5 x 107°, Generate points at 1001 uniformly spaced abscissas and perform 
all calculations in IEEE standard double precision. 


(b) Reformulate f to avoid cancellation error and then repeat part (a). 


. Repeat Exercise 9 for the function f(z) = tan7! 2 — sing. 
11. 


Repeat Exercise 9 for the function f(z) = In(1 +2) — cosz — z+ 1 over the 
interval —5 x 107° <2 <5x 107°. 


Near certain values of # each of the following functions cannot be accurately 
computed using the forrnula as given due to cancellation error. Identify the 
values of x which are involved (e.g., near x = 0 or large positive x) and propose 
a reformulation of the function (e.g., using Taylor series, rationalization, trig 
identities, etc.) to remedy the problem. 


(a) f(x) =1+4+cosz (b) f(z) =e * +sing -1 

(c) f(x) =Ina - In(i/z) (d) f(x) = Vx2 +1- Va? +4 

(e) f(z) =1-2sin? (f) f(x) =In(z + V2? +1) 

(g) f(z) =2-sing (bh) f(z) =Ine-1 

(a) Verify that , 
f(z)=1-—sinz and g(x) = jpaes 


are identical functions. 
(b) Which function should be used for computations when z is near n/2? Why? 
(c) Which function should be used for computations when z is near 31/2? Why? 


It was noted that evaluation of the expression 


333.56° + a” (1100? — 08 — 121 — 2) + 5.56% + 

when a = 77617.0 and 6 = 33096.0 requires at least 37 decimal digits of preci- 

sion. 

(a) HP workstations have a double precision extended format that corresponds 
to the floating point number system F(2, 113, —16381, 16384). Does this 
system provide enough precision to evaluate the above expression? 

(b) What is the smallest value for k for which the floating point ourmber system 
F(2,k,m, M) provides 37 decimal digits of precision? 


Consider the following linear system of equations: 


3.02 -1.05 2.53 Ly —1.61 
4.33 0.56 —1.78 x2 = 7.23 |. 


—0.83 —0.54 147 x3 —3.38 


(a) Determine the solution of this system using exact arithmetic during Gaus- 
sian elimination. 


16. 
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(b) Determine the solution of this system using 3 decimal digit rounding arith- 
metic during Gaussian elimination. 

(c) Explain the difference between the answers found in part (a) and those found 
in part (b). 

One strategy for alleviating the accumulation of roundoff error during Gaussian 

elimination is known as partial pivoting (a more detailed description of this 

process will be provided in Chapter 3). The basic idea is as follows. During the 

ith pass of Gaussian elimination, find the row, starting at row 7 and running 

through the last row of the matrix, that has the largest entry in column i. 

Interchange this row with the current row 7 and proceed with the elimination 

phase. 

(a) Repeat Exercise 15(6) using this partial pivoting strategy. What is the 
relative error in each component of the computed solution? 

(b) Repeat the Linear System of Equations problem considered in the text using 
the partial pivoting strategy. What is the relative error in each component 
of the computed solution? 


CHAPTER 2 


Rootfinding 


AN OVERVIEW 
Fundamental Mathematical Problem 


In this chapter, several techniques will be developed for finding approximate solu- 
tions to the general mathematical problem 


given a function f, find a value for z such that f(x) = 0. 


Such an z is called a zero of the function f or a root of the equation f(x) = 0. This 
problem is therefore known as the rootfinding problem. In the most general setting 
of the rootfinding problem, both the function f and the independent variable x 
could be vector valued. In this chapter, only the scalar case will be considered. 
Systems of nonlinear equations will be treated in Chapter 3. 

The “Solving a Crime” problem capsule in the Chapter 1 Overview (see 
page 2) is one application that gives rise to a rootfinding problem. Here are two 
more examples. 


van der Waals Equation 
Every student of high school chemistry has been exposed to the ideal gas law: 
PV =nRT, 


which relates the pressure (P), volume (V), and temperature (T) of an ideal gas. 
Here, n represents the number of moles of gas present, and R is the universal gas 
constant. Real gases satisfy this equation only approximately; under conditions 
of high pressure and/or low volume the approximation becomes more crude. One 
attempt to model the relationship among pressure, volume, and temperature for 
real gases is the van der Waals equation: 


2 
(P+ oF) (V — nb) = nT. 


The term involving the parameter a corrects the pressure for intramolecular at- 
tractive forces (i.e., the pressure would be higher if not for the attractive forces 
exerted among the molecules in the gas). The term involving the parameter } is a 
correction for that portion of the volume of the gas that is not compressible due 
to the intrinsic volume of the gas molecules. Suppose that one mole of chlorine 
gas has a pressure of 2 atmospheres and a temperature of 313 K. For chlorine gas, 
a = 6.29 atm : liter?/mole” and b = 0.0562 liter/mole. What is the volume of the 


gas? 
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Depth of Submersion 


Suppose we want to determine how far a spherical object of radius R will sink into 
a fluid such as water or oil. According to Archimedes’ principle, the object will 
sink to the depth at which the weight of the fluid displaced by the object equals 
the weight of the object. Now, the weight of the object is the product of its mass, 
m, and the acceleration due to gravity, g. If we assume the object has a constant 
mass density, Po, then m = 4rR%p, and 


weight of the object = arP po (1) 


Cross section 


v+(y-R/ =P 
of spherical 
object 


What about the weight of the displaced fluid? Assuming the fluid has den- 
sity py and Vy is the volume of fluid displaced by the object, then 


weight of displaced fluid = ps Vag. (2) 


To complete the specification of the problem, we need to determine Vz. Suppose 
the object sinks to a depth h. Considering the geometry shown in the diagram 
above and applying some basic Calculus (volume by slicing to be exact), we find 
that 


h h 
va=n f ay =x [ (2Ry 9?) ay = wt? (RF). (3) 
0 ft) 
Substituting (3) into (2) and equating the resulting expression with (1) yields 
37 Rpg = Th (z- 3) PFg- 
After some algebraic simplification, this becomes 
4 
Eins — Rosh? + SRpo = 0. (4) 


Therefore, given values for R, p. and py, the depth to which the object sinks is 
determined by solving equation (4) for h. 
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Multiplicity 


Throughout this chapter, the concept of the multiplicity of a root will play an 
important role. 


Definition. A root p of the equation f(x) = 0 is said to be a RooT oF 
MULTIPLICITY m if f can be written in the form 


f(t) = (x - p) q(x), 
where lim; q(z) #0. A root of multiplicity one is called a SIMPLE Root. 


For polynomials, multiplicity can be determined by first factoring the poly- 
nomial and then examining the power on each factor. For example, since 


a® + 2° — 1224 + 225 + 412? — dle + 18 = (x — 1)3(2 + 3)?(x ~ 2), 
it follows that the equation 
o® + 2° — 12244 225 + 41x? - Sie +18 =0 
has a root of multiplicity 3 at ¢ = 1, a root of multiplicity 2 at 2 = —3, and a 


simple root at 2 = 2. 
What about the equation f(x) = 0, where 


f(o) = 20+ In (52) 


Clearly, f(0) = 0, so the equation has a root at x = 0. But what is the multiplicity 
of this root? For equations with non-polynomial functions, the following theorem 
is helpful. 


Theorem. Let f be a continuous function with m continuous derivatives. 
The equation f(x} = 0 has a root of multiplicity m at 2 = p if and only if 


f(p) = f'(p) = fp) = = FO" (p) = 0 but f'(p) F 0. 
Proof. Let f be a continuous function with m continuous derivatives. Sup- 
pose that f(p) = f’(p) = fl"(p) = --- = fl™-Y(p) = 0 but fl (p) £ 0. 
Expanding f in a Taylor polynomial of degree m — 1 about the point x = p 
yields 

m—-1 (k)( (m) z)), 

oe - py* f Gi )) (a ae eae 

m! 


where €(x) is between z and p. Using the hypotheses regarding the value of 
f®©(p) for 0 < k < m—|1, this last expression simplifies to 


FO) (E(«)) 


== na a oe 


f(z) = 
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Let’s examine the factor f(™(€(x)) more closely. Since f(™ is continuous, it 
follows that 


firm $0 (€(x)) = FO (lim €(@)) = f(D) #0. 


Now, define q(x) = f'™ (€(x))/m!. Then f(x) = (x — p)"q(x), where 
lim g(x) #0, 


and the equation f(z) = 0 has a root of multiplicity m at x =p. 


Conversely, suppose that f(z) = 0 has a root of multiplicity m at 2 = p. 
Then there exists a function g, with limg,q9(z) # 0, such that f(z) = 
(x — p)q(z). By direct calculation, we find 


iOS LOH) ss fF" OSG 
however, for the mth derivative, we find 


f° (p) = lim f™ (2) = m! lim q(x) £ 0, 


“zp “Lp 


as desired. o 


Returning to the problem posed before the theorem, for f(x) = 2x + In (452), we 
calculate 


f(0) = f’(0) = f’(0) =0, but f”(0) =-4 40. 


Hence, the equation f(x) = 0 has a root of multiplicity 3 at 2 =0. 


Remainder of Chapter 


Basic techniques for solving the rootfinding problem fal] into two categories: simple 
enclosure methods and fixed point iteration schemes. Although the word iteration 
appears in the name of only one of these categories, both classes of methods are, 
in fact, iterative processes. The development of two simple enclosure methods, the 
bisection method and the method of false position, will be the focus of Sections 1 
and 2, respectively. These techniques wil] be guaranteed to converge to a root of 
the specified function under very mild conditions; the rate of convergence will tend 
to be quite slow, however. The general theory of fixed point iteration schemes will 
be discussed in the next section. Properly constructed, such schemes will exhibit 
very rapid convergence; unfortunately, these techniques require stronger conditions 
to guarantee convergence. Section 4 will be devoted to the development of New- 
ton’s method, the classical fixed point iteration scheme. The secant method, which 
can be considered a variation on either Newton’s method or the method of false 
position, will be developed in Section 5. In Section 6, general techniques for accel- 
erating the convergence of iterative schemes will be developed. Special techniques 
for accelerating Newton’s method will also be presented. The chapter concludes 
with a section dealing with the special problem of locating roots of polynomials. 
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A Remark about Pathological Examples 


Although many powerful techniques for locating the roots of functions will be de- 
veloped in this chapter, it must be kept in mind that there are problems for which 
it will be difficult, if not downright impossible, for even the best of techniques to 
find the desired solution. Consider, for example, locating the roots of 


f(z) = 80? + “3 In[(m — x)?] +1. 


It is fairly easy to establish that f has two simple real roots. First, note that 
since 3z? + 1 is always positive, the term involving In|(x — z)*] must be negative 
for f to evaluate to zero. This implies that any roots must lie on the interval 
xz € (—1,7+1). Combining the facts that limy.,1n[(m — 2)?] > —oo and f is 
continuous everywhere except at x = m with the knowledge that f(m—1) > 0 and 
f(r+1) > 0 guarantees the existence of a root on each of the intervals (1 —1, m) and 
(x,7-+1). Monotonicity of f on (—oo,7) and on (x,00) guarantees the uniqueness 
of the root in each interval. 

Since the natural logarithm term must balance 3z* + 1 and the coefficient 
on the logarithm term is roughly 0.01, it is reasonable to assume that both roots 
are close to m. It follows that In|(x — «)?| must be roughly — (3x? +1) r4|,_ or 
~—2981.6. Therefore, (#-z) must be on the order of te7*490-788 ~ +10~647 go that 
x w+ 10-847, The floating point number system on most machines will never be 
able to resolve these values. The moral of the story is a simple one: Pathological 
problems do exist, so always perform some basic analysis before rushing to the 
computer. 


2.1 THE BISECTION METHOD 


As noted in the overview to this chapter, rootfinding techniques are generally 
divided into two categories: simple enclosure methods and fixed point iteration 
schemes. All simple enclosure methods are based on the Intermediate Value Theo- 
rem. These methods essentially work by first finding an interval which is guaranteed 
to contain a root and then systematically shrinking the size of that interval. In this 
section, we will develop and study the performance of the most basic simple enclo- 
sure method, which is known as the bisection method. 


Intermediate Value Theorem 


Before we begin our development of the bisection method, let’s take a moment to 
review the Intermediate Value Theorem. This theorem appears in probably every 
calculus book; for a proof, consult a textbook on advanced calculus or real analysis. 


Theorem. Let f be a continuous function over the closed interval [a, bl, and 
let k be any real number that lies between the values f(a) and f(b). Then 
there exists a real number c with a << b such that f(c) =k. 


In plain English, a function that is continuous on a closed interval is guaranteed to 
assume every value between the values achieved at the endpoints of the interval. 
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So what does this have to do with the rootfinding problem? Basically, the In- 
termediate Value Theorem provides a means for identifying intervals which enclose 
the real zeros of continuous functions. All that is needed is to find an interval such 
that the values of the function at the endpoints of that interval are of opposite sign. 
The magnitudes of these endpoint values are irrelevant. As long as one endpoint 
value is positive and the other negative, zero is somewhere between the values, and 
at least one zero of the function is guaranteed to exist on that interval. 

To demonstrate this idea, consider the function f(z) = 2° + 22? - 32-1. 
The value of this function at a string of consecutive integers is listed below. Each 
change in the sign of the function value signals an interval that contains a real zero 
of the function. This function, therefore, clearly has three simple real zeros, one 
each on the intervals: (—3, —2), (—1,0), and (1,2). 


f(-3) =-1 f(-1) =3 fl) =-1 
f(-2) =5 ) f(0)=-1 ) f(2)=9 ) 
Bisection Method 


Suppose we have used the Intermediate Value Theorem to locate an interval that 
contains a zero of & continuous function, What do we do next? Our objective will 
be to systematically shrink the size of that root enclosing interval. Perhaps the 
simplest and most natural way to accomplish a reduction in interval size is to cut 
the interval in half. Once this has been done, we determine which half contains a 
root, by once again using the Intermediate Value Theorem, and then repeat the 
process on that half. This technique is known as the bisectzon method. 

From this very basic description of the bisection method, it should be clear 
that the method generates a sequence of root enclosing intervals. For notational 
convenience, let (@n, bn) be the enclosing interval during the nth iteration of the 
method. Furthermore, let p, denote the midpoint of the interval Jan, 6,]; that is, 


= Qn + bn 
Pa 


We will use py not only as one of the endpoints for the next enclosing interval, but 
also as an approximation to the location of the exact root p. If py is an accurate 
enough approximation—an issue that will be addressed shortly—the iteration is 
terminated; otherwise, the Intermediate Value Theorem is invoked to determine 
which of the two subintervals, (@n,Pn) or (Pr, bn), contains the root and becomes 
(Qn41,0n41). The entire process is then repeated on that subinterval. 


EXAMPLE 2.1 Bisection Method in Action 


We discovered earlier that the function f(z) = 2° + 2x* — 32 - 1 has a simple zero 
on the interval (1,2). Let’s run through a few iterations of the bisection method to 
demonstrate the general procedure. ; 

For the first iteration, we have (a;,6)) = (1,2) and we know that f(a;) <0 
and that f(b,) > 0. The midpoint of this first interval, and our first approximation 
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to the location of the exact root, is 


ath a aie | 
Ss ag Pa = 1.5. 


Pi 


To determine whether the root is contained on (a1,p1) = (1,1.5) or on (p11) = 
(1.5, 2), we calculate 


f(p1) = 2.375 > 0. 


Since f(a) and f(p;) are of opposite sign, the Intermediate Value Theorem tells us 
the root is between a; and p). For the next iteration, we therefore take (a2, bg) = 
(a1, pi) = (1, 1.5). 

The midpoint of this new interval, and our second approximation to the lo- 
cation of the root, is 
— agtby = 141.5 


9 —o = 1.25. 


P2 


Note that 
f (po) ¥ 0.328 > 0, 


which is of opposite sign from f(a2). Hence, the Intermediate Value Theorem tells 
us the root is between a2 and pz, so we take (a3, 3) = (a2, pa) = (1, 1.25). 
In the third iteration, we then calculate 
a3 + b3 141.25 


P3 = ae 5 = 1.125 


and 
f(p3)  —0.420 <0. 


Here, we find that f(a3) and f(p3) are of the same sign, which implies that the root 
must lie somewhere between p3 and b3. For the fourth iteration, we will therefore 
have (a4, ba) = (p3, b3) = (1.125, 1.25) and 


= 1.1875. 


_ aa tba 
Pg = 2 


To ten decimal places, p = 1.1986912435, so the absolute error in pq is 1.119 x 1072. 


Even though we've developed the basic iterative process which lies at the 
heart of the bisection method, we’re not yet ready to construct an algorithm. Since 
the bisection method is iterative in nature, the algorithm must contain a stopping 
condition. We have to have some way to decide when p,, is sufficiently accurate to 
terminate the iteration. However, to properly formulate a stopping condition, we 
need to understand the convergence properties of the sequence generated by the 
bisection method. We will now undertake an analysis of these properties. 
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Convergence Analysis 


Under what circumstances will the sequence of approximations generated by the 
bisection method converge to a root of f(z) = 0? When the sequence does converge, 
what is the speed of convergence? Much of the information we need to answer these 
questions is contained in the following theorem. 


Theorem. Let f be continuous on the closed interval [a, | and suppose that 
f(a) f(b} <0. The bisection method generates a sequence of approximations 
{pn} which converges to a root p € (a,b) with the property 


b-a 


Qn 


lPn — p| < 


Notes 


1. Pay close attention to the conclusion of this theorem. It states that the 
bisection method converges to a root of f, not the root of f. The condition 
F(a) f(b) < 0 implies differing signs at the endpoints of the interval, which 
guarantees the existence of a root, but not uniqueness. There may be more 
than one root on the interval and there is no way to know, a priori, to which 
root the sequence will converge, but it will converge to one of them. 


2. Since |pn —p| is the absolute error in the approximation p,, the expression on 
the right-hand side of the inequality at the end of the theorem is referred to 
as a theoretical error bound. The error at any stage of the iterative process 
can never be larger than this quantity. Working with problems for which the 
analytical solution is known and verifying that a theoretical error bound is 
satisfied is a powerful tool for eliminating “human errors” in the development 
of computer codes. 

3. The requirement that an interval [a, 6] be found such that f(a) f(b) < 0 implies 
that the bisection method cannot be used to locate roots of even multiplicity. 
For such roots, the sign of the function does not change on either side of the 
root. This restriction is, in fact, common to all simple enclosure techniques 
and is not peculiar to the bisection method. 


Proof of Theorem. Since the quantity b — a is constant and 27” — 0 as 
n — oo, establishing the error bound will be sufficient to prove convergence 
of the bisection method sequence. By construction of the bisection algorithm 
and using the notation introduced previously, for each n, p € (an,bn) and Dy 
is taken as the midpoint of (ay, bn). This implies that p, can differ from p by 
no more than half the length of (an, b,); that is, 


1 
\Dn - pl < abn = an). 


However, again by construction, 


1 1 1 
ban — An = 9 (on a Qn—1) = qien-2 = An—2) Se paar (01 = ay). 
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Recalling that 6; = b and a; = a and combining the last two equations 
produces the desired error bound 


b-a 


a 2 


lpn — P| S 


So the sequence of approximations generated by the bisection method is always 
guaranteed to converge to a root of the equation f(x) = 0. How quickly will the 
sequence converge? From the theoretical error bound, we have 


b-a 1 
IPn —p| < on =(b-a)—. 


Hence, if we take \ = b— a and 8, = 1/2” in the definition of rate of convergence, 
we see that the sequence generated by the bisection method has rate of convergence 
O(1/2"), 

What about order of convergence? Given that each time the enclosing interval 
is cut in half we obtain an extra base 2 digit of accuracy, we might expect to find 
that convergence is linear (i.e., a = 1). Unfortunately, we run into a slight problem 
when we try to apply the definition. Examine the last column of Table 2.1. This 
table shows the results of fifteen iterations of the bisection method when applied to 
the function f(z) = 2?+ 2x? —3x—1 with a starting interval of (1,2). Observe that 
sometimes the error drops sharply from iteration to iteration (e.g., from iteration 10 
to iteration 11), sometimes the error deceases only slightly (from iteration 6 to 7), 
and sometimes the error actually increases. It is therefore quite likely that the limit 
which appears in the definition of order of convergence won’t exist. 

All is not lost, however. In Section 1.2, we saw that for a linearly convergent 
sequence . 
[Pati ~ P| A” |p — pl, 


where \ is the asymptotic error constant. As evidenced by the theoretical er- 
ror bound, the bisection method sequence does satisfy this relationship with A = 
1/2. Furthermore, observe from Figure 2.1 that the overall relationship between 
log |en41| and log |e,| appears to be linear with slope one. From this, it then follows 
that the general trend between old and new errors is linear. We therefore stretch 
the definition of order of convegence and say that the convergence of the bisection 
method sequence is order a = 1 with asymptotic error constant A = 1/2. 


Stopping Condition and Algorithm 


We are now in position to select a stopping condition. In what follows, let « be 
a specified convergence tolerance. For any rootfinding technique, there are three 
primary measures of convergence with which to construct the stopping condition. 
These are 


(1) the absolute error in the location of the root 
Terminate the iteration when |pp — p| <. 
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Enclosing Interval Approximation Absolute Error 


1 (1,000000,2.000000) 1.500000 0.3013087565 
2 — (1.000000,1.500000) 1.250000 0,0513087565 
3 (1.000000,1.250000) 1.125000 0.0736912435 
4  (1.125000,1.250000) 1.187500 0.0111912435 
5 (1.187500,1.250000) 1.218750 0.0200587565 
6  (1.187500,1.218750) 1.203125 0.0044337565 
7 = (1.187500,1.203125) 1.195312 0.0033787435 
8  (1.195312,1.203125) 1,199219 0.0005275065 
9  (1.195312,1.199219) 1.197266 0.0014256185 
10 (1.197266,1.199219) 1.198242 0.0004490560 
11 (1.198242,1.199219) 1.198730 0.0000392252 
12 (1.198242,1.198730) 1.198486 0.0002049154 
13 (1.198486, 1.198730) 1.198608 0.0000828451 
14 (1.198608,1.198730) 1.198669 0.0000218099 
15 (1.198669,1.198730) 1.198700 0.0000087077 


TABLE 2.1: Fifteen Iterations of Bisection Method Applied to f(x) = a° + 2a* — 3x2 — 1 Starting 
from the Interval (1, 2) 


3 “2.5 
0919 led 


Figure 2.1 Error after n +1 iterations versus error after n iterations 
for approximations generated by the bisection method when applied to 
the function f(x) = 2° + 227 ~3r—1. A log-log scale has been used to 
accomodate the variation in the order of magnitude of the errors. 
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Figure 2.2 (Left graph) The error conditions based on the location of 
the root will provide more reliable termination of the iterative process 
for this function. Simply requiring the value of the function to be small 
could lead to large errors in the approximation of the location of the 
root. (Right graph) The test for root condition will provide more reliable 
termination of the iterative process for this function. Simply requiring 
that Dn be “close” to p could lead to f(pn) being far from zero. 


(2) the relative error in the location of the root 

Terminate the iteration when |p, — p| < €|pn\. 
(3} the test for a root 

Terminate the iteration when |f(p,)| < «. 


There is no general rule of thumb for selecting one stopping condition over another, 
and it is worth noting that none of these conditions works well in all cases. 


Consider the function plotted on the left in Figure 2.2, f(z) = (z— 1)”, which 
has a wide, flat plateau surrounding the root. Either of the stopping conditions 
based on the location of the root will produce more reliable termination than the 
test for root condition in the sense that a function value near zero will not guarantee 
a small error in the approximate location of the root. In particular, even if f(r) + 
10-7, we can only guarantee that x « 1+107!. A small error in the location of 
the root, on the other hand, will always lead to a function value near zero. The 
reliability of the stopping conditions is reversed for the function plotted on the right 
in Figure 2.2, f(x) = (« — 1)'/", which has a nearly vertical portion surrounding 
the root. For this function, a function value near zero guarantees that 2 must be 
close to 1, but having x close to 1 does not imply that f(r) will be close to 0. In 
particular, if 21-4107", then f(x) ~ 107}. 
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Throughout our development of rootfinding techniques we will implement a 
stopping condition based on the absolute error in the location of the root. The 
careful reader is now asking how we are going to accomplish this since we don’t 
know the value of p. In the case of the bisection method, though, we do have 
some theory to fall back on. From the proof of the bisection method convergence 
theorem, we know that 


We can therefore terminate the bisection method when 


bn — Qn 
2 


Bringing together the basic iterative structure we developed earlier and the 
stopping condition we just selected, we can now construct an algorithm for the 
bisection method. 


<eé. 


GIVEN: function whose zero is to be located, f 
left endpoint of interval, a 
right endpoint of interval, b 
convergence tolerance, € 
maximum number of iterations, Nmaz 


STEP 1: save sfa = sign ( f(a) ) 
STEP 2: for i from 1 to Nmaz do 
STEP 3: p=a+(b-a)/2 
STEP 4: if ( (b— a) < 2e ) then OUTPUT p 
STEP 5: save sfp = sign ( f(p) ) 
STEP 6: if ( sfa * sfp <0) then 
assign the value of p to b 
else 
assign the value of p to a 
assign the value of sfp to sfa 
end 
end 
STEP 7: OUTPUT a message that the maximum number 


of iterations has been exceeded prior to 
achieving convergence 


There are a few important remarks that need to be made regarding this algo- 
rithm. First, although we introduced the bisection method in terms of a sequence 
of enclosing intervals and a sequence of approximations to the location of the zero, 
a close examination of the method indicates that we only need to know the current 
enclosing interval and the current approximation to ensure proper execution. This 
is why the variables a, b, and p in the above algorithm are scalars and not arrays. 

Second, the standard measure for the amount of work performed by a rootfind- 
ing technique is the number of times the function f is evaluated, not the number of 
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iterations. In general, the function may be very complicated, and each evaluation 
may require many floating point operations. We therefore want to avoid any un- 
necessary function evaluations. This is the reason that the signs of f(a) and f(p) 
are saved in STEPS 1 and 5. Had these signs not been saved, we would have had ta 
re-evaluate the function at both a and p to perform the test in STEP 6. By saving 
the signs, the algorithm, as written, requires only one new function evaluation per 
iteration. 

Finally, observe that in STEP 6, we worked with the signs of the function 
values, rather than checking the sign of the product f(a)f(p). By construction, 
both a and p will be converging toward a zero of f. Hence, both f(a) and f(p) 
will be approaching zero. Multiplying these values together could then result in 
underflow. 


EXAMPLE 2.2 A Second Demonstration Problem 


As we develop additional rootfinding techniques in subsequent sections, we will want 
to have at least a couple of problems on hand with which to compare performance. 
One of the problems we will use is locating the root of 


x? 490? —-32-1=0 
on the interval (1,2). The values contained in the first three columns of Table 2.1 
(see page 63) were obtained by applying the bisection method algorithm to this 


problem with a convergence tolerance of 5 x 107°. 
As a second example, we will use the equation 


tan(rz) — 2-6 = 0. 


This equation actually has an infinite number of roots. Here, we want to ap- 
proximate the smallest positive root, which Figure 2.3 suggests lies on the inter- 
val (0.40,0.48), Applying the bisection method algorithm to the function f(r) = 
tan(ma) ~ x — 6 with a convergence tolerance of 5 x 107° produces the results 


Enclosing Interval | Approximation 
1 (0.400000,0.480000)  0.4400000000 
2  (0.440000,0.480000)  0.4600000000 
3 (0.440000,0.460000) — 0.4500000000 
4  (0.450000,0.460000) 0.4550000000 
5 (0.450000,0.455000) 0.4525000000 
6  (0.450000,0.452500)  0.4512500000 
7  (0.450000,0.451250) — 0.4506250000 
8 (0.450625,0.451250) —0.4509375000 
9  (0.450937,0.451250)  0.4510937500 
10 (0.450937,0.451094)  0.4510156250 
11 (0.451016,0.451094)  0.4510546875 


To ten decimal places, p = 0.4510472588, so the absolute error in the final bisection 
method approximation is roughly 7.426 x 107°. 
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Figure 2.3. The point of intersection between the graph of y = 7+6 
and the graph of y = tan(sz) is the location of the smallest positive root 
of tan(mz) -z-6 =0. 


An Application Problem: Saving for a Down Payment 


A couple plans to open a money market account in which they will save the down 
payment for the purchase of a home. They have $13,500 from the sale of some 
stock with which to open the account. After examining their budget, they feel they 
can comfortably deposit an extra $250 into the account each month. What is the 
minimum interest rate, compounded on a monthly basis, that the couple must earn 
on their investment to reach their goal of accumulating $25,000 within three years? 

To answer this question, we need to determine how money that earns com- 
pound interest grows over time. Suppose that P dollars are invested at an annual 
interest rate r, compounded m times per year. At the end of the first compounding 
period, interest in the amount of P= is credited to the account. The total value of 
the investment is then 


P+P=P(i+—). 


r 


The interest earned at the end of the next compounding period is P (1 + =) jz 80 


the value of the account grows to 


P(r Z)em(rrt)Zar(i+Z)(rd)=r(z)" 


After the third compounding period, the account value becomes 


Ped} en(ieZy Ean (42) (ed)=n (rg) 


m 
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and, in general, after » compounding periods, the value is 


Tm 
P(i+4)" 
™m 
This quantity is referred to as the future value of the investment. 
We are now ready to return to the original problem. Let r denote the annual 
interest rate paid by the money market account, which is compounded on a monthly 


basis (i.e., m = 12), At the end of three years, the initial $13,500 investment will 
have grown in value to 


13500 (1 ie 2) 


dollars. As for the monthly deposits of $250, the total value of all 36 deposits will 
be 
35 34 33 

250 (1+ 5) +250(1+5) +250(1+ 5) +---+250 (1+ 2) +250. (1) 
Here, we have used the fact that the first monthly deposit earns 35 months of 
interest, the second earns 34 months of interest, the third earns 33 months of 
interest, and so on. The sum of the geometric progression given by (1) may be 
expressed in closed form as 

p59 Abt ig) 1 i) "=! 
7/12 

If no other deposits are made to and no withdrawals are taken from the account, 
the couple will therefore have saved 


36 
r \36 (i+3)° -1 
13500 (1+ —5) + 2504s 
a 12 ue 7/12 
dollars for the down payment by the end of three years. The minimum interest 
tate that the money market account must pay for the couple to reach their goal is, 
accordingly, the solution of the equation 


(l+%)"- 


36 1 
13500 (+5) + 250-8 = 25000. 


12. 
Let’s define 


36 
36 eee ae 
f(r) =13500(1+5) + pot BY =? — 25000. 


Note that (0.01) = —1956.54, but f(0.10) = 3645.91, so the desired interest rate is 
somewhere between 1% and 10%. Using the bisection method with a convergence 
tolerance of 5 x 107°, we find, after 15 iterations, 


7 = 0.0439395. 


The couple therefore needs to find an account paying roughly 4.40%, compounded 
monthly. 


EXERCISES 


1, 
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Verify that each of the following equations has a root on the interval (0,1). Next, 
perform the bisection method to determine p3, the third approximation to the 
location of the root, and to determine (a4, b4), the next enclosing interval. 

(a) In(1 +2) -cosz =0 (b) 2° +22-1=0 

(c) e 7 ~x=0 (d) cosx -—z=0 


In Exercises 2-5, verify that the given function has a zero on the indicated interval. 
Next, perform the first five (5) iterations of the bisection method and verify that each 
approximation satisfies the theoretical error bound of the bisection method, but that 
the actual errors do not steadily decrease. The exact location of the zero is indicated 
by the value of p. 


2. 
. f(z) =sine, (3,4), p=r 

. f(z) =1-Ing, (2, 3), p=e 
. f(z) =2° - 3, (1,2), p=V¥3 


. Determine a formula which relates the number of iterations, n, required by the 


aa Ph & 


10. 


11. 


12. 


13. 


f(t)=24+27-32-3, (1,2), p=V3 


bisection method to converge to within an absolute error tolerance of ¢, starting 
from the initial interval (a, 6). 


. Modify the algorithm for the bisection method as follows. Remove the input 


Nmaz, and calculate the number of iterations needed to achieve the specified 
convergence tolerance using the results of Exercise 6. 


. Suppose that an equation is known to have a root on the interval (0,1). How 


many iterations of the bisection method are needed to achieve full machine pre- 
cision in the approximation to the location of the root assuming calculations are 
performed in IEEE standard double precision? What if the root were known 
to be contained in the interval (8,9)? (Hint: Consider the number of base 2 
digits already known in the location of the root and how many base 2 digits are 
available in the indicated floating point system.) 


. By construction, the endpoints of the enclosing intervals produced by the bi- 


section method satisfy a, < a2 < a3 <--- < b3 < be < by. Prove that the 
sequences {an} and {bn} converge and that 

lim @n = lim bp = lim pn =p. 

NOOO n—CO NICO 
It was noted that the function f(x) = x? + 2x” —32~1 has a zero on the interval 
(—3, —2) and another on the interval (—1,0). Approximate both of these zeroes 
to within an absolute tolerance of 5 x 1075, 
Approximate /13 to three decimal places by applying the bisection method to 
the equation z® ~ 13 = 0. 
Approximate 1/37 to five decimal places by applying the bisection method to 
the equation 1/z — 37 = 0. 
In one of the worked examples of this section, the smallest positive root of the 
equation tan(7z} — x — 6 = 0 was approximated. Graphically determine an 
interval which contains the next smallest positive root of this equation, and then 
approximate the root to within an absolute tolerance of 5 x 10°. 
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14. 


15. 


16. 


17. 


18. 


The equation (x — 0.5)(z + 1)3(a — 2) = 0 clearly has roots at s = -1, x = 
0.5, and = 2. Each of the intervals listed below encompasses all of these 
roots. Determine to which root the bisection method converges when each of the 
intervals below is used as the starting interval. 


(a) (-3,3) (b) (—1.5,3) 
(c) (2, 4) (d) {—2,3) 
(e) (-1.5,2.2) (f) (-7,3) 
It can be shown that the equation 

3 


ia —6- 5 sin(22) =0 


has a unique real root. 

(a) Find an interval on which this unique real root is guaranteed to exist. 

(b) Using the interval found in part (a) and the bisection method, approximate 
the root to within an absolute tolerance of 107°. 

For each of the functions given below, use the bisection method to approximate 

all real zeros. Use an absolute tolerance of 107° as a stopping criterion. 

(a) f(a) =e" +2? -2-4 

(b) f(x) =25 —2?-102 +7 

(c) f(z) = 1.05 - 1.042 + Ing 

Peters (“Optimum Spring-Damper Design for Mass Impact,” SIAM Review, 

39(1), pp. 118-122, 1997) models the impact of an object on a spring-damper 

system. If the displacement of the object following impact, is limited, then the 

maximum force exerted on the object is minimized when the nondimensional 

damping coefficient, ¢, is the solution of the equation 


cos cv =| 2274 pa? 2c 


on the interval 0 < ¢ < 1/2. The maximum (nondimensional) force is then given 
by 5 
Fim = exp [-¢(ry + Tn) , 


Tf = cos} C/f1—c 


is the time of the end of the stroke and 
Tm = cos" [es - 4¢?)| /V1-¢@ 


is the time when the maximum force occurs. Determine ¢ to within an absolute 
tolerance of 5 x 10%, and then calculate ry, Tm and Fim. 

DeSantis, Gironi, and Marelli (“Vector-Liquid Equilibrium from a Hard-Sphere 
Equation of State,” Industrial and Engineering Chemistry Fundamentals, 15, 
182-189, 1976) derive a relationship for the compressibility factor of real gases 
of the form 


where 


need Le 

ar 
where y is related to the van der Waals volume correction factor. If z = 0.892, 
what is the value of y? 
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19. Reconsider the “Saving for a Down Payment” application problem. Which of 
the following scenarios requires a smaller compounded monthly interest rate to 
achieve a goal of $25,000 after three years: 

(a) a $14,000 initial investment with $250 per month thereafter; or 


(b) a $12,500 initial investment with $300 per month thereafter? 


2.2. THE METHOD OF FALSE POSITION 


In Section 2.1, we developed the bisection method for approximating the zeros of 
continuous functions. On the plus side, the bisection method is straightforward 
to implement. In terms of computational cost, only one new function evaluation 
is needed per iteration, so the method is as inexpensive as one can expect. Most 
important, the sequence of approximations generated by the method is guaranteed 
to converge. 

On the minus side, the sequence of approximations generated by the bisec- 
tion method converges only linearly, with a rate of convergence of O(1/2"). Fur- 
thermore, even though there is a theoretical bound available for the error in each 
approximation, the bound can be overly pessimistic. As a result, it is possible that 
an approximation which is accurate to within the specified convergence tolerance 
will fail to terminate the iteration. 

In this section we will develop a second simple enclosure method, one which is 
known as the method of false position. We will show that the sequence of approxi- 
mations obtained from this new method is still guaranteed to converge and that the 
convergence of the sequence is still only linear. However, for the method of false 
position, we will be able to compute a fairly accurate estimate for the error in each 
approximation, not just a theoretical error bound. This error estimate will allow 
the formulation of a stopping condition which should greatly reduce the possibility 
that a sufficiently accurate approximation will fail to terminate the iteration. 


Method of False Position 


Being a simple enclosure method, the method of false position iteratively determines 
a sequence of root enclosing intervals, (an, 6,), and a sequence of approximations, 
which we shall denote by p,. During each iteration, a single point is selected 
from (a@n,6n) to approximate the location of the root and serve as py. If py is 
an accurate enough approximation, the iterative process is terminated. Otherwise, 
the Intermediate Value Theorem is used to determine whether the root lies on 
the subinterval (an,p,) or on the subinterval (pp,b,). The entire process is then 
repeated on that subinterval. 

The method of false position differs from other simple enclosure methods in 
the procedure used to select pp. Whereas the bisection method simply chooses 
the midpoint of the enclosing interval, the method of false position uses the z- 
intercept of the line which passes through the points (an, f(an)) and (bn, f(bn)) 
aS Pn (see Figure 2.4). The equation of the line which passes through (an, f(a@n)) 
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Figure 2.4 Schematic for the selection of py for the method of false 
position. 


and (bn, f(bn)) is given by 


g~ fig = TODA NS) pgs, 


_ bp — an 
Setting y = 0 and then solving for z = py yields the formula 


bn An 


Be a= IN aaa aay 


EXAMPLE 2.3 Method of False Position in Action 


We know that the function f(r) = 23 + 2x? — 32 — 1 has a simple zero on the 
interval (1,2). Let’s run through a few iterations of the method of false position to 
demonstrate the general procedure. 

For the first iteration, we have (a1,0)) = (1,2) and we know that f(a)) = 
~1 <0 and that f(b.) =9 > 0. Our first approximation to the location of the zero 
is then ; pee 

ay = 
=b, —f(d : =2-9 =11. 
eG ia. en 
To determine whether the zero is contained on (a;,p1) = (1,1.1) or on (p),)) = 
(1.1, 2), we calculate 


f(p1) = -0.549 < 0. 
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Since f(a;), and f(pi) are of the same sign, the Intermediate Value Theorem tells 
us the zero is between p, and b;. For the next iteration, we therefore take (a2, b2) = 
(pr, b1) = (1.1, 2). 

Our second approximation to the location of the zero is 

2-11 
= 2 — 9——___ = 1151748638. 
mM 9 — (—0.549) 
Note that 
f(p2) % 0.274 < 0, 


which is of the same sign as f(a2). Hence, the Intermediate Value Theorem tells us 
the zero is now between po and be, so we take (a3, 3) = (p2, bz) = (1.151743638, 2). 
In the third iteration, we calculate 


b3 — a3 
= bs — f(b = 1,17684091 
P3 3 re 3) Fh) 


— f (ag) 
and 


f(ps) + —0.131 <0. 


Hence, we once again find that f(a3) and f(p3) are of the same sign, which implies 
that the zero must lie somewhere between p3 and b3. For the fourth iteration, we 
will therefore have (a4, 64) = (p3,b3) = (1.17684091, 2). Recall that to ten decimal 
places, p = 1.1986912435, so the absolute error in p3 is 2.185 x 1072. 


Having seen how the method of false position works, the next issue to address 
is the performance of this new method relative to that of the bisection method. 
Recall that when comparing the performance of rootfinding techniques, the funda- 
mental measure of work is the number of function evaluations. For the bisection 
method, we have seen that n iterations require m function evaluations. For the 
method of false position, note that four function evaluations—f(a,), f(d.), f(p1) 
and f(p2)—were needed in the preceding example to obtain p3. In particular, f(a1) 
and f(61) were used to calculate p,. Each additional iteration then required one 
new function value: f(p,) for the second iteration and f(pe) for the third. Thus, a 
properly constructed false position algorithm, one that saves function values that 
will be needed for subsequent iterations, will cost n+ 1 function evaluations for n 
iterations. 

Returning to the preceding ¢xample, it should be clear that the appropri- 
ate comparison to make is between the approximation p3 from the method of 
false position and the approximation pa from the bisection method. With f(x) = 
x? + 22? — 32 — 1 and an initial interval of (1,2), the bisection method produces 
pa = 1.1875 (see the first example in Section 2.1}. The error in this approximation 
is roughly 1.119 107%, which is about half the error in the value p3 obtained above. 
On the other hand, with one more iteration each, the error from the bisection ap- 
proximation is roughly 2.006 x 10-?, which is nearly twice the false position error 
of 1.043 x 10-?. Hence, for this problem, the two methods perform equally well. 
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There are, however, problems for which bisection clearly outperforms false 
position. As a case in point, consider the function f(z) = tan(ax) — 2 — 6, which 
has a zero on (0.4,0.48). To guarantee an absolute error of less than 5 x 1075, 
the bisection method uses 11 function evaluations. With these 11 evaluations, the 
actual error is approximately 7.43 x 107°. The method of false position needs 15 
function evaluations to produce an error less than 5 x 10~° and 18 evaluations to 
achieve an error of roughly 7.43 x 1078. 

As one might expect, there are other problems for which false position out- 
performs bisection. As an example, consider f(a) = 2° + 2x? — 3x —1, which has a 
zero on (—3, —2). To guarantee an absolute error of less than 5 x 10-5, the bisection 
method uses 15 function evaluations, and the actual error is roughly 2.83 x 1075. 
The method of false position achieves both of these error levels with only 5 function 
evaluations. 

So what should we make from all of this? From the outset, we might have 
expected false position to always outperform bisection. After all, false position uses 
more information about the function. However, our examples have shown that this 
is not the case. There is also no general theory to indicate which method will be 
better for a given problem. The main advantage which the method of false position 
has over the bisection method is the existence of a computable error estimate. We 
will derive this error estimate toward the end of the section when we discuss an 
appropriate stopping condition. In addition to providing more reliable termination, 
the existence of an error estimate will allow us, in Section 2.6, to accelerate the 
convergence of the false position sequence. Such acceleration is not possible for the 
bisection method. 


Convergence Analysis 


Does the sequence of approximations generated by the method of false position, 
{pn}, converge to a root p? If so, what is the order of convergence? To answer 
these questions, we need to examine the associated sequence of errors, {én}, where 
en = Pn —p. The sequence {p,} converges if and only if |en| + 0 as n — oo, and 
the order of convergence is determined by the asymptotic relationship between |e,,| 
and |en—1|. 

The error sequence {e,} is governed by what is known as the error evolution 
equation. To construct the error evolution equation, take the equation for py and 
subtract p from both sides. This yields 


ee er 1.) zg 5 — (1) 


Next, approximate the function values f(an) and f(bn) by the second degree Taylor 
polynomials : 

~ # jf (p) =, 2 

Flan) © £'(P)(Qn — 2) + (an — 7) 


(bn) © £'(0)(bn — 2) + PG, ~ nF, 


(2) 
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where the fact that f(p) = 0 has been taken into account. The term f(b,) — f(@n) 
is then approximately 


F(bn) ~ Flan) © F'(P\(bn ~ an) + 2 (0, - 9)? = (am — 99 


= (bn — Gn) 7) + POs, + an - 20) 


(3) 


Substituting (2) and (3) into (1), factoring the term }, — p, and dividing out the 
term by — dp, yields 


f'(p) + FA (bn - 7) | 
n — Dp (by ae Zi 
abhi. at £4B) +P (bn + an — 2p) 
f’(p) 


(p)(bn + On — 2p)’ 


* (bn — P)(dn — Par + fi (4) 


Before we can proceed any further, we have to make one very important 
observation. Table 2.2 displays the results of ten iterations of the method of false 
position applied to three different test problems. Focus on the middle column, which 
lists the enclosing interval for each iteration. Notice that in each case one of the 
endpoints remains fixed, while the other endpoint is just the previous approximation 
to the location of the root. Now, not every problem will have one endpoint fixed 
for all iterations; however, when using the method of false position, as the iteration 
proceeds, one of the endpoints will always eventually become fixed. In fact, the 
method of false position will always eventually settle into one of the configurations 
shown in Figure 2.5. 

Let’s now return to the error evolution equation. Because one endpoint of the 
enclosing interval remains fixed while the other is just py_1, (4) becomes 


Cn & NCn—1, 
where 
_— Lf" (p) 
2f/(p) + If" (p) 
and 


pad QP) On remains fixed 
~ | ba —p, by remains fixed * 


Provided |A| < 1, it then follows that the sequence generated by the method of false 
position converges, and the convergence is linear with asymptotic error constant |A|. 

The only question that remains is whether |A| really is less than 1. Remember 
that the method of false position eventually settles into one of the configurations 
shown in Figure 2.5. Let’s consider the configuration in the upper left corner in 
more detail. Because a, is fixed, 1 = an — p. Now, a, —p < 0 and f”’(p) < 
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f(z) = 23 + 22? — 3c ~1 on (1,2) 


n Enclosing Interval Approximation 
1 (1.0000000000,2.0060000000) 1.1000000000 
2 (1.1000000000,2.0000000000) 1.1517436381 
3 (1.1517436381,2.0000000000) ——‘1.1768409100 
4 (1.1768409100,2.0000000000) 1.1886276733 
5 (1.1886276733,2.0000000000) 1.1940789113 
6 (1.1940789113,2.0000000000) 1.1965820882 
7 (1.1965820882,2.0000000000) 1.1977277544 
8 (1.1977277544,2.0000000000) 1.1982513178 
9 (1.1982513178,2.0000000000) 1.1984904185 
10 (1.1984904185 ,2.0000000000) 1.1985995764 
f(z) = tan(rr) — 2 ~ 6 on (0.4, 0.48) 
n Enclosing Interval Approximation 
1 (0.4000000000,0.4800000000) 0.4208674108 
2 (0.4208674108,0.4800000000) 0.4332027501 
3 (0.4332027501,0.4800000000) 0.4404957388 
4 (0.4404957388,0.4800000000) 0.4448079249 
5 (0,4448079249,0.4800000000) 0.4473577484 
6 (0.4473577484,0.4800000000) 0.4488655162 
7 (0.4488655162,0.4800000000) 0.4497571072 
8 (0.4497571072,0.4800000000) 0.4502843380 
9 (0.4502843380,0.4800000000) 0.4505961108 
10 (0.4505961108,0.4800000000) 0.4507804752 
f(z) = 2 + 22? — 8x —1 on (-3, —2) 
n Enclosing Interval Approximation 
1 (—8.0000000000, —2.0000000000) —2.8333333333 
2  (—3.0000006000, —2.8333333333) —2.9079283887 
3 (—3.0000000000, —2.9079283887) —2.9120026293 
4  (-—8.0000000000, —2.9120026293) —2.9122172667 
5 (—8.0000000000, —2.9122172667) —2.9122285522 
6 (—3.0000000000, —2.9122285522) —2.9122291456 
7  (—3.0000000000, —2.9122291456) —2.9122291768 
8  (—3.0000000000, -2.9122291768) —2.9122291784 
9  (-3.0000000000, —2.9122291784) —2.9122291785 
10 (—3.0000000000, —2.9122291785) —2.9122291785 


TABLE 2.2: Ten Iterations of the Method of False Position Applied to Three Test Problems 
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Pip} < 0 
fips>0 
a, fixed 


B= Pa 


f(p) > 0 
fiej>od 
a,=2 
b fixed 
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Figure 2.5 Eventual configurations for the method of false position. 


0, so (an — p)f"(p) > 0. Since f’(p) is also greater than zero, it follows that 
2f'(p} + (an — pF" (p) > (@n — p)f"(p) and 

(an — p) f"{P) 
2f'(p) + (an — DIS" (p) 
Hence |A| < 1 and the method of false position converges. In a similar fashion, the 


remaining three configurations can be shown to lead to |A| < 1. The details are left 
as an exercise. 


O< =A<1. 


Stopping Condition 


We would like to terminate the method of false position when |e,| falls below a 
specified convergence tolerance, ¢. To implement this idea, we must have a formula 
for estimating |é,| which involves only quantities which can be calculated during 
the course of the iteration. As a starting point, note that 


Cn = Pn —P 
= Pn —Pa-1 + Pn-1—P 
=Pn — Pn—-1 + €n-1- (5) 
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From the error evolution equation, e, * Aen—1, OF, equivalently, €n-1 % €n/A. 
Substituting this expression into (5), solving for en, and taking absolute values 
yields 
» 
len] © |S] [Pn ~ Pr—al- (6) 
Next, we focus on 4. The value of A can be estimated using terms from the 
sequence {p,} as follows. Consider the ratio 
Pn Pn-1 _ (Pn —P)—(Pn-1 =P) _ en = Cnt 
Pn-1 — Pn~2 (Pn—1 —?) > (Pn—2 = p) €n—1 — Cn—2 


Using the relations e, * Aen; and en_2 %& €n—1/A, we find 


Pao Paat (A= Vener 


Pn-1—-Pn-2 (1— })en-1 “) 


Equations (6) and (7) together constitute a computable estimate for |e,|. The ac- 
curacy of this estimate is demonstrated in Table 2.3. Consequently, an appropriate 


stopping condition for the method of false position is to terminate the iteration 
when 


Pn — Pn-il <6, 


A 
A-1 
where A is obtained from (7). 


An Application Problem: Depth of Submersion 


In the Chapter 2 Overview (see page 55), we found that a spherical object of 
radius R and density po placed on the surface of a fluid of density g¢ would sink to 
a depth h which is a root of the equation 


en =Rpjh? + +R po =0. 


In deriving this equation it was assumed that the object was not fully submerged 
in the fuid. 

Suppose we place a spherical ball of cork with a radius of R = 5 cm and a 
density of po = 0.120 g/cm? into motor oil with a density of pr = 0.890 g/cm’. 
Five iterations of the method of false position with an initial interval of (0,10) and 
a convergence tolerance of e = 5 x 107° yield ps = 2.3043353119. The estimate for 
the error in ps is roughly 4.378 x 107°. Thus, the ball of cork sinks to a depth of 
roughly 2.304 cm in the motor oil. 


EXERCISES 
1. Each of the following equations has a root on the interval (0,1). Perform the 
method of false position to determine p3, the third approximation to the location 
of the root, and to determine (a4, b4), the next enclosing interval. 
(a) In(1 + 2) — cost =0 (b) 2° +22 —1=0 
(c) oe * -—z=0 (d) cost -~a =0 


The Method of False Position 


f(z) = 0° +227 — 32 — 1 on (1,2) 


Absolute Error, |en| Error Estimate, fel [Pn — Pn—1| 


SeLmrAaanrpwony| 3 


CwA OMA LhHWNH| 8 


_ 
oOo 


OmNDWILPwWNne| 3s 


_ 
Oo 


0.0986912435 
0.0469476054 
0.0218503335 
0.0100635702 
0.0046123322 
0.0021091553 
0.0009634891 
0.0004399257 
0.0002008251 
0.0000916671 


9.0236382347 
0.0104374516 
0.0046903760 
0.0021254290 
0.0009668808 
0.0004406324 
0.0002009723 
0.0000916978 


f(x) = tan(mz) — 2 — 6 on (0.4, 0.48) 


Absolute Error, |e,| 


0.0301798480 
0.0178445088 
0.0105515200 
0.0062393339 
0.0036895105 
0.0021817426 
0.001 2901516 
0.0007629208 
0.0004511480 
0.0002667836 


Error Estimate, say [Pn — Pn—1| 


0.0105481928 
0.0062382353 
0.0036891385 
0.0021816149 
0.0012901074 
0.0007629055 
0.0004511426 
0.0002667817 


f(x) = 2° + 2x? - 3z —1 on (-3, ~2) 


7.8895845e-02 
4,3007897e-03 
2.265491 7e-04 
1.1911748e-05 
6.2624802e-07 
3.29241 83e-08 
1.7309461e-09 
9,1002317e-11 
4,784173le-12 
2.5091040e-13 


Absolute Error, |e,| Error Estimate, 5oy lpn — Pn—1l 


2.3538350e-04 
1.1936259e-05 
6,2631578e-07 
3.2924370e-08 
1.7309465e-09 
9,1002226e-11 
4,7843508e-12 
2.5153164e-13 


TABLE 2.3: Confirmation of Error Estimate 
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2. 


3. 


Construct an algorithm for the method of false position. Remember to save 
function values which will be needed for later iterations and to implement a 
stopping conditton based on equations (6) and (7). 


Confirm that |A| <1 for the remaining configurations in Figure 2.5. 


In Exercises 4-7, an equation, an interval on which the equation has a root, and the 
exact value of the root are specified. 


(1) 
(2) 


“~~ 
is) 
a 


Ont Dan 


10. 


11. 


Perform the first five (5) iterations of the method of false position, 


Verify that the absolute error in the third, fourth and fifth approximations sat- 
isfies the error estimate 


lpn — Pn-il- 


inal |sA5 


How does the error in the fifth false position approximation compare to the 
maximum error which would result from six iterations of the bisection method? 


. The equation x? + 2? — 32 — 3 = 0 has a root on the interval (1,2), namely 


aV3, 


. The equation z’ = 3 has a root on the interval (1,2), namely x = V3. 

. The equation z° — 13 = 0 has a root on the interval (2,3), namely </13. 

. The equation 1/2~37 = 0 has a zero on the interval (0.01, 0.1), namely 2 = 1/37. 
. The function f(z) = siz has a zero on the interval (3,4), namely « = a. 


Perform three iterations of the method of false position to approximate this zero. 
Determine the absolute error in each of the three computed approximations. 
What is the apparent order of convergence? What explanation can you provide 
for this behavior? 


. (a) Verify that the equation «4 — 18%? + 45 = 0 has a root on the interval (1, 2). 


Next, perform three iterations of the method of false position. Given that 
the exact value of the root is z = V3, compute the absolute error in the three 
approximations just obtained. What is the apparent order of convergence? 
What explanation can you provide for this behavior’? 

(b) Verify that the equation z* — 18x? + 45 = 0 also has a root on the interval 
(3,4). Perform five iterations of the method of false position, and compute 
the absolute error in each approximation. The exact value of the root is 
zg = V15. What is the apparent order of convergence in this case? 

(c) What explanation can you provide for the different convergence behavior 
between parts (a) we (b)? 

The function f(z) = 2? + 22? — 3x — 1 has a zero on the interval (—1,0). 

Approximate this zero to within an absolute tolerance of 5 x 107 a 


For each of the functions given below, use the method of false position to approx- 
imate all real fate, Use an absolute tolerance of 107° as a stopping condition. 


(a) fiz}= "4 q%—2-4 
(b) f(x) =a - a? - 10r+7 
(c) f(x) = 1.05 - 1.042 + Inz 
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12. In the literature, it is not uncommon to find the method of false position ter- 
minated when |pr — pn—i| < €. Comment on the accuracy of this stopping 
condition. Consider the cases \ #0, A = 1/2 and A #1. 

13. A storage tank is in the shape of a horizontal cylinder with length L and radius r. 
The volume V of fluid in the tank is related to the depth h of the fluid by the 


equation 
V= [ cos”? (=) —(r—h)V2rh —h?| L, 


If r =1 meter, L = 3 meters, and V =7 cubic meters, determine h. 

14. The equation 2? = 1 ~cos(V/2r) + V2sin(V22z) has two real roots. One of them 
is at 2 = 0. Determine an interval that contains the other root, and then approx- 
imate this root to three decimal places. This problem arises in the calculation of 
the amplitude of the solution to a nonlinear third-order differential equation. See 
Gottlieb (“Simple Nonlinear Jerk Functions with Periodic Solutions,” American 
Journal of Physics, 66 (10), 903-906, 1998) for details. 

15. Rework the “Depth of Submersion” problem to determine the depth to which a 
glass marble of radius 2 cm and density 0.040 g/om* sinks in water of density 
0.998 g/em?. 


2.3. FIXED POINT ITERATION SCHEMES 


In the previous two sections, we studied simple enclosure methods for solving 
rootfinding problems. We found that simple enclosure methods are guaranteed 
to converge to a root of the equation being studied. Unfortunately, the rate of con- 
vergence (i.e., the number of iterations required to achieve a given level of precision 
in the approximate root) tends to be slow. In this section, the concept of fixed point 
iteration schemes will be presented. When properly constructed, these schemes will 
exhibit rapid convergence. The price paid for this accelerated convergence, however, 
is the loss of guaranteed convergence. 


Review of Mean Value Theorem 


The theorems which we prove in this section will make extensive use of the Mean 
Value Theorem. Therefore, rather than jumping straight into a discussion of fixed 
points and fixed point iteration schemes, we'll start with a review of the statement 
and consequences of this important theorem. 


Theorem. If the function f is continuous on the closed interval [a,b] and 
differentiable on the open interval (a,b), then there exists a real number € € 


(a, b) such that 
(0) - fla) 


b-a 


The Mean Value Theorem has an interesting geometric interpretation (see 
Figure 2.6). The expression (f(b) — f(a))/(b — a) gives the slope of the line which 
passes through the points (a, f(a)) and (, f(b)). The line through these two points 


O= 
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Figure 2.6 Geometric interpretation of the Mean Value Theorem. 


is often called a secant line. Of course, f’(£) gives the slope of the line tangent to 
the graph of f at the location x = €. Hence, for a continuous and differentiable 
function, the Mean Value Theorem guarantees there is at least one point on the 
graph of the function at which the tangent line is parallel to the secant line. 

When we actually apply the Mean Value Theorem later in the section, we will 
use a slightly different formulation. Taking the equation 


and mutliplying through by b- a yields . 
f(b) — fla) = f'(E)(b — a). 


Note this latter equation relates the difference of two function values to the differ- 
ence of the corresponding input values. Throughout the remainder of this section, 
be on the lookout for expressions with this property. An application of the Mean 
Value Theorem is likely to follow soon after. 


EXAMPLE 2.4 An Inequality Involving the Sine Function 


Consider the inequality 
|sinb — sina| < |b— al, 


where a and 6 are any real numbers. Here is one of those expressions alluded 
to above. Note that the left side of the inequality involves the difference of two 
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function values and the right side involves the difference of the corresponding input 
values. This is a clear indication that we should use the Mean Value Theorem. 

So let a and 6 be any two real numbers. If a = 6, then the inequality is 
trivially satisfied. Suppose then that a # 6. Because the sine function is both 
continuous and differentiable everywhere, the function is certainly continuous and 
differentiable on the interval between 2 = a and x = b. Applying the Mean Value 
Theorem to the function f(x) = sina, it follows that 


sin b — sina = cos€(b— a), 


for some € between a and 6. Taking the absolute value of both sides of this last 
expression and using the fact that the magnitude of the cosine function is always 
less than or equal to one produces the desired result. 


Background for Fixed Points 


Consider the function sinz. Since the sine function maps x = 1/4 to the value 
/2/2, the sine function may be thought of as moving the input value of 7/4 to the 
output value of /2/2. On the other hand, the sine function maps zero to zero; i.e., 
sin0 = 0. In keeping with the analogy just established, the sine function fixes the 
location of 0. For this reason x = 0 is said to be a fixed point of the function sinz. 
In general, the following definition is made: 


Definition. A FIxeED PoInT of the function g is any real number, p, for 
which g(p) = p; that is, whose location is fixed by g. 


This definition provides a direct analytical means for determining fixed points, 
which can be used in simple cases. 


EXAMPLE 2.5 Fixed Points of the Logistic Equation 


One of the most popular mathematical models for the generation-by-generation 
growth of a population is the logistic equation: 


Pn+1 = CPn(1 — Pn); 


where 0 < c < 4 is aconstant and p, denotes the normalized size of the population 
in the n-th generation, measured relative to the maximum population which the 
environment can support. The fixed points of the function on the right-hand side 
of the logistic equation play an important role in the dynamics of the long-term 
behavior of the population. Using the definition, the fixed points for the logistic 
equation are the solutions of 

p = cp(1 — p). 


Solving this quadratic equation produces p = 0 and p = (ec — 1)/c. 
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Figure 2.7 Graphical determination of the existence of a fixed point 
for the function g(x) = e7*. 


For those functions for which the analytical approach fails, it is possible to 
investigate the existence of fixed points using a graphical approach. Begin by 
graphing y, = g(x) and y2 = z on the same set of coordinate axes. At any point of 
intersection between the two graphs, we are guaranteed that y, = y2, and hence, 
g(x) =x. Therefore, points of intersection between the graph of g and the graph 
of the identity function represent the fixed points of g. To illustrate this process, 
consider g(x) = e~*. The equation e~* = z cannot be solved by analytical means, 
but Figure 2.7 indicates the existence of a unique fixed point somewhere in the 
vicinity of 0.6. 

We have so far encountered a function with two fixed points and a function 
with a unique fixed point. There are also functions which do not have any fixed 
points. The graphical approach described in the previous paragraph is sufficient 
to establish that e? and Inz have no fixed points (their graphs never intersect the 
graph of the identity function). Either the analytical approach or the graphical 
approach can be used to establish that x? +1 also has no fixed points. To conclude 
the introductory material on fixed points, here is a theorem which states conditions 
under which a function is guaranteed to have a unique fixed point. 


Theorem. Let g be continuous on the closed interval [a,b] with g : [a,b] > 
[a, 6]. Then g has a fixed point p € [a,b]. Furthermore, if g is differentiable on 
the open interval (a,b) and there exists a positive constant & < 1 such that 
|g'(x)| < k <1 for all x € (a,b), then the fixed point in [a, }] is unigue. 
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Proof. (1) Existence 


Assume that g is continuous on the closed interval [a, b] with g : (a, b]  [a, d]. 
Define the auxiliary function h(z) = g(x) — x2. Note that A satisfies two 
important properties. First, since h is the difference of two functions that 
are continuous on [a,b|, h is also continuous on that interval. Second, by 
construction, the roots of h are precisely the fixed points of g. 


Now, since minz¢(a, 9(z) 2 @ and maxze{a,t] 9(Z) < 6, it follows that 
h(a) =g(a)-a>0 and A(b)=g(b)-b <0. 


If either h(a) = 0 or h(b) = 0, then we have found a root of h, which, by 
construction, is a fixed point of g, and we are done. If neither h(a) = 0 
nor h(b) = 0, then h(b) < 0 < Aa). Since h is continuous on [a,b], the 
Intermediate Value Theorem may be invoked to guarantee the existence of 


p € [a, b| such that h(p) = 0, which implies g(p) = p. 
(2) Uniqueness 


This part of the proof will proceed by contradiction. Suppose that p and q 
are both fixed points of the function g on the interval [a,b], with p # g. By 
the definition of a fixed point, 9(p) = p and g(q) = q. Then 


Ip - a| = lo(p) - 9(9)| 
= |g'{E)(p — q)| by the Mean Value Theorem 
= |9'(€)|lp — al 


< klp-a| < |p—al 
which is a contradiction. Hence, p = gq, and the fixed point is unique. 


It should be noted that the hypotheses of this theorem are sufficient condi- 
tions. By themselves, these conditions guarantee the existence and uniqueness of a 
fixed point. However, the hypotheses are not necessary conditions, meaning that it 
is possible for a function to violate one or more of the hypotheses, yet still have a 
(possibly unique) fixed point. For example, consider the function g(x) = 4z(1 — z) 
on the interval [0.1,00). Since lim, as. g(2) — —00, g clearly does not map [0.1, 00) 
onto itself. Furthermore, limz—oo |g’(z)| — -++oo, so that g also violates the hypoth- 
esis regarding the magnitude of the first derivative. However, g has fixed points at 
x =0 and zc = 3/4, so that g does in fact have a unique fixed point on the interval 
(0.1, 00). 


Fixed Point Iteration 


If it is known that a function g has a fixed point, one way to approximate the value 
of that fixed point is to use what is known as a fized point iteration scheme. These 
can be defined as follows: 
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Definition. A FixeD POINT ITERATION SCHEME (also known as a FuNC- 
TIONAL iteration scheme) to approximate the fixed point, p, of a function 9g; 
generates the sequence {pn} by the rule pp = g(pn-1) for all n > 1, given a 
starting approximation, po. 


Within a fixed point iteration scheme, the function g is often referred to as 
the zteration function. 


EXAMPLE 2.6 Fixed Point Iteration in Action 


In Figure 2.7, we saw that the function g(r) = e~* has a unique fixed point some- 
where near z = 0.6. To locate this fixed point more precisely, we will now perform 
fixed point iteration with g as the iteration function and pg = 0. The first ten 
iterations yield 
po = 0 Pi = 9(po) = 1.0000000000 
pa = 9(pi) = 0.3678794412 
p3 = g(p2) = 0.6922006276 
pa = 9(p3) = 0.5004735006 
Ps = 9(pa) = 0.6062435351 
Pe = 9(ps) = 0.5453957860 
7 = 9(pe) = 0.5796123355 
ps = o(p7) = 0.5601154614 
po = g(pg) = 0.5711431151 
Pio = g(p9) = 0.5648793474. 


The sequence appears to be converging, albeit, very slowly. In fact, it takes more 
than 20 iterations for p, to agree with the exact fixed point to at least 5 significant 
decimal digits. 


Although the study of fixed points is an important subject in its own right, the 
objective of this chapter is still the rootfinding problem—some connection between 
fixed point problems and rootfinding problems therefore has to be established. For- 
tunately, or perhaps unfortunately, every rootfinding problem can be transformed 
into any number of different fixed point problems. Some of these fixed point prob- 
lems will converge rapidly, some will converge slowly and some will not converge at 
all. The conversion process is actually quite simple. Take the rootfinding equation, 
f(z) = 0, and algebraically transform it into an equation of the form x = ...; 
the expression on the right-hand side of the resulting equation is a corresponding 
iteration function g(x). 

To demonstrate the process, consider the function f(x) = 2° + 2? — 32 — 
3, which has a unique zero on the interval (1,2). The objective here will be to 
approximate this root using fixed point iteration. Starting from the equation 2° + 
x? — 3x — 3 = 0, one possible iteration function arises by transposing the linear 
term to the right-hand side and dividing by 3. Let g(x) denote the resulting 
iteration function, g(x) = (2? +2? —3)/3. Alternatively, both the linear and 
constant terms could be transposed to the right-hand side. Dividing the resulting 
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equation by 2? and subtracting 1 produces the iteration function g2{z) = -1+ 
(32 + 3)/z?. Three additional iteration functions that will be examined are g3(z) = 
3+ 32 — 22, g4(x) = V/(3 + 3a — 2?)/a, and g5(x) = c—(22 +27 —32-3)/(30? + 
2x — 3). The function 93s) is derived by isolating the cubic term and taking the 
cube root, while g4(x) is derived by isolating the cubic term, dividing by x, and then 
taking the square root. An explanation for 95 (x) will be deferred to the next section. 
The results of applying each of these five functions, with a starting approximation 
of po = 1 and a stopping criterion of |pn — Pn_i| < 107”, are shown in Table 2.4. 

The results in Table 2.4 display a wide range of convergence behaviors. The 
sequence generated by the first iteration function converges, but to a fixed point 
outside the interval (1,2). The sequence generated by g(x) fails to converge de- 
spite attaining values quite close to the fixed point determined by g;(x)—see the 
values of pg and pio. Convergence of the sequences for the third and fifth iteration 
functions is rapid, but is achieved in very different manners. Each iteration with 
g3(z) produces roughly one additional decimal place of accuracy, whereas g5(z) 
roughly doubles the number of correct decimal places with each iteration. Had 
greater precision been requested, the fifth sequence would have converged much 
faster than the third. Finally, the fourth sequence converges to the desired fixed 
point, but does so very slowly. 


Convergence 


Based on the results of this experiment, the following basic problem must be tackled 
whenever fixed point iteration is to be used to solve a rootfinding problem: 


Given a function f whose roots are to be determined, can an itera- 
tion function, g, be constructed such that the fixed points of g are 
the roots of f AND for some starting approximation po, the sequence 
Pn = g(Pn—1) converges to p, a root of f? 


To have any hope of answering this question in the affirmative, conditions on g 
that will guarantee convergence of the iteration scheme must first be known. The 
following theorem provides the needed information. 


Theorem. Let g be continuous on the closed interval [a,b] with g : [a,b] - 

[a, 5]. Furthermore, suppose that g is differentiable on the open interval (a, 6) 

and there exists a positive constant k < 1 such that |g'(xz)| < k < 1 for all 
€ (a,6). Then 

(1) the sequence {p,} generated by pp = g(pn—1) converges to the fixed point 

p for any po € [a, b|; 

(2) [Pn — Pn-1| < k” max(po — @,6 — po); and, 

(3) lpn — pl S feel - Pol: 


Note that the hypotheses of this theorem are precisely those that were suf- 
ficient to guarantee that a function has a unique fixed point, so the reference in 
conclusion (1) to THE fixed point of g is justified. As with the previous theorem 
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n gil) (b) ga(z) (c) ga(z) (d) gate) (s) gs(#) 

0 +1.0 +1.0 +1.0 +1.0 +1.0 

1 —.3333333333  5.0000000000 1.709975947 = 2.236067978 — 3.000000000 
2 —.9753086420 —.2800000000  1.733134316 1.451059202 — 2.200000000 
3 —.9921709716 26.5510204100  1.731994802 1.901682432 1.830150754 
4 —.9974310264 —.8827544117  1.732053695 1.635808067 1.737795453 
5 —.9991480696 —.5486245114 = 1.732050659 1.788336635 —1.732072292 
6 —.9997165068  3.4989256110  1.732050815 1.699764653 = 1.732050808 
7 —~.9999055558 1024544340 = -1.732050807 1.750767137 1.732050808 
8 ~.9999685246 314.0796731 1,721269132 

9  —.9999895088 —.9904178717 1.738283891 

10 —.9999965030 —.9706946913 1.728454854 

11 —.9999988343 —.9066955739 1.734127847 

12 —.9999996114 —.6595130205 1.730851932 

13 —.9999998704  1.8484159170 1.732743080 

14 —.9999999568 2.8747932030 1.731651158 

15 .4065545020 1.732281557 

16 24.52938017 1.731917588 

17 —.8727117320 1.732127723 

18 —.4986188487 1.732006401 

19 §.0499512450 1.732076446 

20 — .2882970612 1.732036005 

25 Does not converge 1.732051758 

30 1.732050747 

31 1.732050842 


3 2 
+2°-3 ; 3243 ; 
g(x) = ——~—— - » g2(z}=—-1+ @ ga(2) = V34 3a — 22, 


_ pbs? _ 2 +22 — 32-3 
gal) = ——et ee 32? + 24-3 


TABLE 2.4: Comparison of Fixed Point Iteration Schemes 
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in this section, the hypotheses of this theorem are sufficient conditions for conver- 
gence of the iteration scheme, though not necessary. Direct calculation shows that 
for the fifth iteration function examined above, maxzeii,2)|95(Z)| = 8, so that the 
derivative bound is violated, yet the sequence generated by g5(x) converged very 
rapidly. 


Proof. (1) To establish the first part of the theorem, it must be shown that 
lpn — p| > 0 as n — o for any starting value po € [a,b]. Therefore, let 
Po € [a,d|. Since g : (a,6] — [a,b], we are guaranteed that pz = g(Pn—1) is 
well-defined and that p, € [a, 6] for all nm. Furthermore, 


[Pn — P| = |9(Pn-1) — 9(P)| definition of p, and p 
= |9'(€)||Pn—1 — DI Mean Value Theorem 
< klpp—1 — p| hypothesis on g’ 
Sk? |pp_2 — pl repeat previous 3 steps 
<k"|po — pl. 


Now, since k <1, 
lim |Pn —p| < lim k”|po — p| = |po — p|_ lim k” = 0. 
N00 nO MAO 

(2) Combining the bound 


Pn - P| Sk" |po — pl 
obtained in the proof of part (1) with the bound 


|po — p| S max(po — a,b — po) 
establishes the second conclusion of the theorem. 


(8) Proceeding in exactly the same manner as in part (1), it can be shown 
that 


[Pn+1 — Pn| < k"\p1 — pol- 
Now, let m >n. Then 


lPm — Pal = |Pm — Dm—1 + Pm—-1 — Pm—2 + +++ + Pnti — Pn| 
< |Pm — Pm—1| + \Pm—1 — Pm-2| +--+ + |Pnt1 — Pal 
<k™ py — pol +k”? |p1 — pol + +k" lpr — pol 
= kp, — pol(L + BE REE RPM), 


In part (1), it was established that pn — p as m — 00, so 


foe) 
eae 
lp — Pol $A" — Pol Dk = lps — Pol, 
i=0 
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where the formula for the sum of a convergent geometric series was used to 
obtain the final result. Oo 


The theoretical error bound established in part (3) of this theorem clearly 
demonstrates the importance of the parameter k on the convergence behavior of a 
fixed point iteration scheme. When k is “small,” the error in the approximation to 
the fixed point will be reduced rapidly, but as k — 1, the rate of convergence of the 
approximation sequence should decrease dramatically. Notice that when k = 1/2, 
the rate of convergence should be roughly the same as the bisection method. 


Order of Convergence for Fixed Point Iteration Schemes 


Analytically determining the order of convergence for an arbitrary iteration scheme 
can be very difficult. Fortunately, for fixed point iteration schemes of the form 
Pa = 9(Pn—1), the order of convergence can be completely characterized in terms 
of the value of the derivatives of the iteration function at the fixed point; that is, 
in terms of the values of g*)(p). 


Theorem. Let g be a continuous function on the closed interval (a, d) with 
g : a,b] — [a, 6] and suppose that g’ is continuous on the open interval (a, b) 
with |g’(z)| < k < 1 for all x € (a,b). If g'(p) £0, then for any pg € [a,b], 
the sequence py = (Pp—1) converges only linearly to the fixed point p. 


Proof. First note that the hypotheses stated in the first sentence of the the- 
orem are precisely those that guarantee that g has a unique fixed point, p, 
on the interval |a, b] and, that for any starting value po € [a,], the sequence 
generated by pn = g(Pn—1) will converge to p. Therefore, the only thing that 
needs to be proven here is that if g’/(p) # 0, then the convergence is only 
linear. In other words, it must be shown that 


lim [Pata — P| =X€ (0,1) 
noo |Py — P| 
for sorne X. 


So consider |pn+1 ~ p|. Using the definition of the sequence {pn}, the 
definition of a fixed point and the Mean Value Theorem, it can be shown that 


IPn+a — Pl = |g(Pn) — 9(P)| 
= |9'(cn)||Pa — Pl, 


where ¢, is between py and p. Since pp — 7, it follows by the squeeze theorem 
that c, — p. Furthermore, because g’ is continuous on (a,b), 


jim |g'(en)| = |g’ (tim, en) = Ig’ (p)|- 


Hence, 
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Finally, because |g’(p)| € (0,1) by hypothesis, the order of convergence of 
the sequence {pn} is one (linear convergence) with asymptotic error constant 


|9’(p)|. O 


To obtain higher-order convergence, it is clear that the iteration function must 
have a zero derivative at the fixed point. The next theorem indicates that the more 
derivatives of the iteration function which are zero at the fixed point, the higher 
will be the order of convergence of the generated sequence. 


Theorem. Let g be a continuous function on the closed interval [a, b] with 
a > 1 continuous derivatives on the open interval (a,b). Further, let p € (a, b) 
be a fixed point of g. If 


g'(p) = 9" (p) =--- = g'(p) = 0, 


but g'%(p) # 0, then there exists a J > 0 such that for any po € [p—4,p +], 
the sequence pn = 9(Pn-1) converges to the fixed point p of order a with 
asymptotic error constant 


fam atl . IO) 


n—-00 len|* a! 


Proof. Let’s start by establishing the existence of a d > 0 such that for any 
Do € lp — 6,p + 4], the sequence pa = g(pn—1) converges to the fixed point p. 
We'll tackle the question of order of convergence later. 

Let k < 1. Since g’(p) = 0 and g’ is continuous, it follows that there exists 
a 6 > Osuch that |g'(x)| < & < 1 for alle € J = [p-— 6,p + 4]. From this it 
follows that g: J — I; for if « € I then, 


lg(z) — pl = |9(z) — g(P)| 
= |9' (lle — pl 
<klz-pl <|e-pl <b 


Therefore, by the general fixed point iteration theorem established earlier, the 
sequence Pn = g(Pn—1) converges to the fixed point p for any po € [p—45, p+ d]. 


To establish the order of convergence, let « € J and expand the iteration 
function g into a Taylor series about 2 = p: 


gi 1@) 
(a ~ 1)! 


(ce) 
(op? + Fe py, 


g(x) = g(p) + g'(p)(z@—p) +--+ + 
where € is between x and p. Using the hypotheses regarding the value of 
g\)(p) for 1 < k < @—1 and letting 2 = pz, the Taylor series expansion 
simplifies to 

g(€) 


a! 


Pati —~ P= (Pn — p), 
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where € is now between p, and p. The definitions of the fixed point iteration 
scheme and of a fixed point have been used to replace g(Pn) with pry) and 
g(p) with p. Finally, let n > 00. Then pp > p, forcing € — p also. Hence 


fim (ents! — lo) 


noo |en|% al? 


OY Pn — p of order a. oO 


Stopping Condition 


For a fixed point iteration scheme that produces a linearly convergent sequence 
(i.e., g'(p) # 0), a stopping condition can be formulated in much the same manner 
as one was formulated for the method of false position. In particular, an estimate 
for |e,| can be constructed from terms in the sequence {pp}. In the case of fixed 
point iteration, the relevant formulas are 


g'(p) | 
24 |= A Oa 1 
and 
9’ (p) Ag Pn — Pn-1 (2) 


Pn-1 — Pn-2 : 


The details are left as an exercise. Thus an appropriate stopping condition would 
involve estimating g’(p) and |e,,| using the formulas given above and terminating 
the iteration when |e,| falls below the convergence tolerance e. 


EXAMPLE 2.7 Error Estimate and Stopping Condition 


We know that the function g(z) = e7? has a unique fixed point somewhere near 
x = 0.6. To ten decimal places, the fixed point happens to be p = 0.5671432904. 
The absolute error in the first ten approximations obtained from fixed point iter- 
ation with po = 0 is listed below, along with the error estimate as obtained from 
equations (1) and (2). 


Error Estimate 

p, = 1.0000000000 €1| = 0.4328567096 

po = 0.3678794412 €9| = 0.1992638492 

p3 = 0.6922006276 €3| = 0.1250573371 0.1099745306 
pq = 0.5004735006 €4| = 0.0666697898 0.0712322670 
ps = 0.6062435351 é5| = 0.0391002447 0.0376047292 
pe = 0.5453957860 €g| = 0.0217475044 0.0222212089 
p7 = 0.5796123355 é7| = 0.0124690451 0.0123155830 
pg = 0.5601154614 €g| = 0.0070278290 0.0070769665 
pg = 0.5711431151 €9| = 0.0039998247 0.0039839812 
Dio = 0.5648793474 €19| = 0.0022639430 0.0022690318 
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The first few error estimates differ from the actual errors by as much as about 12%. 
After p7, however, the error estimates differ from the actual errors by less than 1%. 

If iterations are continued until the error estimate falls below « = 5x 107°, the 
final approximation to 7p is poy = 0.5671477143. The estimate of the error in this 


approximation is 4.42383 x 10~®, which is in excellent agreement with the actual 
error of 4.42385 x 107°. 


When the sequence produced by fixed point iteration has order of conver- 
gence a > 1, a simpler stopping condition can be used. Recall that any order of 
convergence a > 1 corresponds to superlinear convergence and that all superlinearly 
convergent sequences satisfy the limit 


(see Exercises 14 and 15 in Section 1.2). This limit implies that 
Ipn-1 — Dl © [Pn — Pn-1| 


for any superlinearly convergent sequence. Since p, is supposed to be a better 
approximation to p than pp—1, it follows that |pn — Pr—i| should be a conservative 
estimate of the error |e,| = |p, — p|. Consequently, whenever the order of conver- 
gence is greater than one, an appropriate stopping condition would be to terminate 
the iteration as soon as |pp — Pn—i| falls below the convergence tolerance e. 


EXERCISES 


1. Suppose the sequence {pn} is generated by the fixed point iteration scheme 
vn = g(pn—1). Further, suppose that the sequence converges linearly to the 
fixed point p. 

(a) Show that 


t ~ Pn —Pn-1 
: () Pn-1 — Pn-2 
(b) Show that 
é 
~ |_9 () = 
| nl g'(p) = | lpn Pn-1|- 


2. Construct an algorithm for fixed point iteration when the order of convergence 
is linear. 

3. Construct an algorithm for fixed point iteration when the order of convergence 
is superlinear. 


4. In the literature, it is not uncommon to find fixed point iteration terminated 
when |pn — Pr—i| < €, even when convergence is only linear. Comment on the 
accuracy of this stopping condition when convergence is linear. Consider the 
cases g'(p) = 0, g'(p) 1/2 and g’(p) = 1. 
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5. 


10. 


Consider the function g(x) = cos x. 
{a) Graphically verify that this function has a unique fixed point on the real 
line. 


(b) Can we prove that the fixed point is unique using the theorems of this 
section? Why or why not? 


(c) What order of convergence do we expect from the fixed point iteration 
scheme pn = g(pn—1) = CO8(pn—1)? Why? 


(d) Perform seven iterations starting from po = 0. Verify that the appro- 
priate error estimate is valid. To ten decimal places, the fixed point is 
x %& 0.7390851332. 


. Consider the function g(x) =1+2—- }2°. 


(a) Analytically verify that this function has a unique fixed point on the real 
line. 

(b) Can we prove that the fixed point is unique using the theorems of this 
section? Why or why not? 

(c) What order of convergence do we expect from the fixed point iteration 
scheme pn = g(pn—1)? Why? 

(d) Perform seven iterations starting from pp = 0. Verify that the appropriate 
error estimate is valid. 


. Consider the function g(x) = 2x(1 — x), which has fixed points at « = 0 and at 


ee /2, 

(a) Why should we expect that fixed point iteration, starting even with a value 
very close to zero, will fail to converge toward x = 0? 

(b) Why should we expect that fixed point iteration, starting with po € (0,1) 
will converge toward ¢ = 1/2? What order of convergence should we expect? 

(c) Perform seven iterations starting from an arbitrary po € (0,1) and numeri- 
cally confirm the order of convergence. 


. Verify that + = Ja is a fixed point of the function 


wey=2(e~8) 


Use the techniques of this section to determine the order of convergence and the 
asymptotic error constant of the sequence pn = g(pn—1) toward x = v/a. 


. Verify that x = ./a is a fixed point of the function 


(x) = % #800 
DE Ba eg. 


Use the techniques of the this section to determine the order of convergence and 
the asymptotic error constant of the sequence pn = g(pn—1) toward x = Ja. 


Verify that x = 1/a is a fixed point of the function g(x) = 2(2 — az). Use 
the techniques of the this section to determine the order of convergence and the 
asymptotic error constant of the sequence pa = g(pn—1) toward z = 1/a. 
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11. Consider the function g(x) = en? 
(a) Prove that g has a unique fixed point on the interval [0, 1]. 
(b) With a starting approximation of po = 0, use the iteration scheme pr = 
e7Pn-i to approximate the fixed point on [0,1] to within 5 x 1077. 
(c) Use the theoretical error bound |pn—p| < Fler —Ppo| to obtain a theoretical 
bound on the number of iterations needed to approximate the fixed point to 
within 5 x 107". How does the number of iterations performed in part (b) 
compare with the theoretical bound? 
12. Repeat Exercise 11 for the function g(x) = 5 Cos Qu. 


13. Repeat Exercise 11 for the function g(x) = 3(2—e* + 2°), 


14, The function f(x) = e*+x?—2—4 has a unique zero on the interval (1, 2). Create 
three different iteration functions corresponding to this function, and compare 
their convergence properties for approximating the zero on (1,2). Use the same 
starting approximation, pp, for each iteration function. 

15. Repeat Exercise 14 for the function f(z) = 2° — x? — 102 +7 on the interval 
(0,1). 

16. Repeat Exercise 14 for the function f(x) = 1.05 — 1.04% + Inz on the interval 
(1, 2). 


2.4 NEWTON’S METHOD 


The fundamental concepts of fixed point iteration schemes were developed in the 
previous section, and the connection between the fixed point problem and the 
rootfinding problem was made. In this section, we will present Newton’s method, 
which is perhaps the most well known fixed point iteration scheme for approxi- 
mating the roots of an arbitrary function. Many students are first introduced to 
this technique when discussing applications of the derivative in a first course on 
calculus. 


Newton’s Method 


The basic idea behind Newton’s method is quite straightforward. Let p, denote 
the most recent approximation to a zero, p, of the function f. Replace f by its 
tangent line approximation based at the location x = py, and take the az-intercept 
of the tangent line as the next approximation, pp41, to the root (see Figure 2.8). 
Since the tangent line approximation based at 2 = py is given by 


y—F(pn) = f' (pal(z — Pn); 


the explicit expression for pniy is 


_ Fn) 
Pn+1 = Pr (Pn) 


This last equation provides the definition for the iteration function of Newton’s 
method. 
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yefe) 


Tangent 
line 


Pa 


Figure 2.8 Newton’s method for approximating the zero of a function: 
schematic for a single iteration. 


Definition. Newton’s Method is the fixed point iteration scheme based on 
the iteration function 

F(a) | 
f(a)’ 
that is, starting from an initial approximation, po, the sequence {pp} is gen- 
erated via Dn = 9(Pn-1)- 


gia) = 2 - 


Note that each iteration of Newton’s method requires two separate function 
evaluations: one evaluation of the function and one of its derivative. This number 
should be compared to the single function evaluation needed per iteration for both 
the bisection method and the method of false position. 


EXAMPLE 2.8  Newton’s Method in Action 


Recall the primary demonstration problem from previous sections: Locate the 
unique zero of the function f(z) = 2° + 2a* — 3x — 1 on the interval (1,2). To 
apply Newton’s method, the derivative of fis needed. For this problem, f’(z) = 
32? + 4% — 3. With a starting approximation of po = 1, four iterations of Newton's 
method produce the results 
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f (pa) 
SY are: = 1.1986958411; and 
ins f'(p2) 
fps) 
= = 1.1986912435. 
BAS f'(p3) 


The approximation p4 is correct to the digits shown and has an absolute error of 
roughly 1.937 x 107}. 


In the preceding example, Newton’s method achieved an accuracy of 1.937 x 
10-}! with only eight function evaluations—four evaluations of f and four evalua- 
tions of f’. For comparison, starting from the interval (1,2), the bisection method 
needs 36 function evaluations and the method of false position needs 31 evalua- 
tions to produce similar accuracy. The information summarized in Table 2.5 in- 
dicates that Newton’s method also outperforms both the bisection method and 
false position on the other two standard test problems; that is, locating the zero of 
f(x) = tan(mx)—2—6 on (0.4, 0.48) and locating the zero of f(x) = 2°+2x?-32-1 
on (~3, -2). Note that false position performs exceptionally well on the latter prob- 
lem, yet still uses 50% more function evaluations than Newton’s method. 


| tan(rr) — 2 —6 x + 22? ~ 32-1 
p = 0.4510472588 p = —2.9122291785 
Newton’s Method | pp = 0.48 po = -3 
ps = 0.4510472613 p3 = —2.9122291786 


lps — p| 2.448 x 1079 [ps — p| = 9.346 x 107! 
| 10 function evaluations | 6 function evaluations 
Bisection Method | (a,,b,) = (0.4,0.48) (a;, 61) = (—3, —2) 


pos = 0.4510472608 psa = —2.9122291785 
lpes — p| = 1.931 x 10-9 | [p34 — pl & 4.716 x 10-1} 
25 function evaluations | 34 function evaluations 
False Position (a1, bi) = (0.4, 0.48) (ai, 61) = (-3, —2) 


p32 = 0.4510472563 pg = —2.9122291784 
\p32 — p| 2.551. x 10-9 [pg — p| & 9.100 x 107! 
33 function evaluations | 9 function evaluations 


TABLE 2.5: Comparison of Newton’s Method, the Bisection Method and the Method of False Position 


Before delving into an analysis of the convergence of Newton’s method, let’s 
investigate the influence of the initial approximation, pp. Consider the function 
f(z) =23 + 227 — 34 —1. Table 2.6 displays the sequences generated by Newton’s 
method for pp = 1, po = 2, and po = 3. Here, changing the value of po results only 
in a variation in the number of iterations needed to achieve convergence. A similar 
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n Po =1 po = 2 po=3 
1 1.2500000000 1.4705882353 2.0277777778 
2 1.2009345794 1.2471326788 1.4845011523 
8 1.1986958411 1.2006987324 1.2514517238 
4 1.1986912435 1.1986949265 1.2010586170 
5 1.1986912435 1.1986963626 
6 1.1986912435 
TABLE 2.6: Influence of pg on Newton’s Method Sequence 
observation can be made when Newton’s method is initialized with pp = —3, po = 


—2.5, and pp = ~2. All three sequences converge to —2.9122291785, just in a 
different number of iterations. 

Much more substantial changes in performance are noted when working with 
the function f(z} = tan(mr)—a2—6. Recall that when po = 0.48, Newton’s method 
converges to 0.4510472613 in five iterations. With a starting approximation of pp = 
0.4, however, the sequence generated by Newton’s method fails to converge, even 
after 5000 iterations. Thus, unlike simple enclosure methods, Newton’s method is 
not guaranteed to converge for an arbitrary starting approximation. With pp = 
0, we observe another interesting phenomenon. The sequence converges after 42 
iterations to 697.4995475. This is, in fact, one of the many zeroes of f(z) = 
tan(wz} — 2 — 6. Hence, even when Newton’s method converges, it may converge 
to a value very far from po. 

Clearly, these examples demonstrate that the convergence of the Newton's 
method sequence is heavily dependent upon the choice of pp—but with a “good” 
choice of starting approximation, the sequence will converge very rapidly. 


Convergence Analysis for Newton’s Method 


Let’s take a more formal look into the convergence properties of Newton’s method, 
in an attempt to quantify the dependence on the choice of pp. The simplest plan of 
attack is to apply the general fixed point iteration convergence theorem which was 
proven in Section 2.3. To do this, it must be shown that there exists an interval, 
I, which contains the root, p, for which 


1. g is continuous on the interval J; 
2. g maps I into J; and 
3. |g (a}| <k <1 forallze I, 
where g is the Newton’s method iteration function. If all three of these conditions 


can be established, then by the theorem of Section 2.3, it can be concluded that 
Newton’s method will converge for any starting approximation po € J. 


Theorem. Let f be a twice continuously differentiable function on the inter- 
val {a,b] with p € (a,b) and f(p) = 0. Further suppose that f’(p) 4 0. Then 
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there exists a d > 0 such that for any po € J = [p—§ ip + 6], the sequence 
{pn} generated by Newton’s method converges to p. 


Proof. First recall that the Newton iteration function ip given by 


fe 
He) = 2" Fey 


The condition f(p) = 0 then implies that g(p) = p. The proof of the theorem 
will now proceed in three steps. 


Step 1. Show that g is continuous “near” p. 
Given the definition of the iteration function g and the continuity assump- 
tions on the function f, the only possible discontinuity in g would be from 
division by zero with f’. However, the continuity of f’ and the assumption 
that /’(p) 4 0 imply the existence of a positive constant J; such that f’(x) 4 0 
for all 2 € J) = |p — 3), p+4,] Cc |a,b]. The reasoning behind this conclusion 
is that, although f’ could have a zero somewhere in the vicinity of p, con- 
tinuity requires that the distance between p and that zero be of finite, not 
infinitesimal, size. Therefore, g is well defined, and hence continuous, on J). 


Step 2. Show that |g'(}| is “small” near p. 
A straightforward calculation shows that 


F(a) Pe) 
[f(2)P 
| 


Having ialready established that /’(z) # 0 for all z € Jy, it follows that g’ 
is continuous on Jy. Furthermore, g'{p) = 0—which follows from f(p) = 0. 
Chooselany & that satisfies 0< k <1. By an argument similar to the one 
applied jin Step 1, it follows that there exists a positive constant 6, with 6 < dy, 


such that 


g(x) = 


ly'(@)| Sk <1 
for alla2 € f = |[p—6,p4+ 6). 


Step 2.1 Show that g maps the interval I into itself. 
Let « € 7. Then 


iI 


ig(z) — g() 


lg’(E)|le —p| for some € between x and p 
ka — pl < |e — pl <4, 


lg(z) — pl 


IA 


since x € |p~—6,p+4]. In the second line, the mean value theorem was applied. 
Therefore, of) € [p-S,pt+d =I. 
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Summary: Having established that 


1. g is continuous on the interval J; 
2. g maps I into J; and 
3. |g/(x)| <k <1 for allae J, 


the convergence theorem from Section 2.3 guarantees that the sequence gen- 
erated by Newton’s method will converge to the root p for any starting ap- 
proximation pp € J. : O 


Although this theorem guarantees that 4 exists, it may be very small, implying 
the need for a very good starting approximation to ensure convergence of the se- 
quence. For instance, in test problem 2, locating the root of f(z) = tan(rz) — 2-6 
on the interval [0,0.48], it can be shown that 6 = 0.02. It is therefore not un- 
common, in practice, to find Newton’s method combined with a simple enclosure 
method. Several iterations of the simple enclosure method are performed to ob- 
tain the starting approximation for Newton’s method. The interval on which the 
root has been localized can then be used to test the approximations generated by 
Newton’s method. If one of those approximations is found to fall outside the local- 
izing interval, additional iterations of the simple enclosure method are performed. 
Newton’s method is then restarted with a refined initial estimate of the root. This 
procedure is repeated as necessary until convergence is obtained. 

During the course of the above proof, we established that, provided f’(p) 4 0, 


g(p) =p and g/(p) =0, 


where g is the iteration function for Newton’s method. Therefore, the order of 
convergence for a sequence generated by Newton’s method is at least quadratic. To 
determine the exact order of convergence, we need to continue calculating deriva- 
tives of g and evaluating them at « = p until we find a derivative which does not 
evaluate to zero. For the second derivative, we find 


_£@) , F@)F"@) _ SOU @)P 
fe) [pay [F(a))° 


from which it follows that g”(p) = fe. Since this will not be zero in general, 
Newton's method is of order two, with asymptotic error constant A = f(p)/2f"(p), 
provided f’(p) # 0. Because the convergence of Newton’s method is superlinear, 
our work at the end of the previous section suggests that an appropriate stopping 
condition for Newton’s method is to terminate the iteration when |p, — Pn—1| falls 
below a specified convergence tolerance €. 
—————————————————————— 
EXAMPLE 2.9 Demonstration of Order of Convergence and Asymptotic 
Error Constant 


g"(z) 


2 


Consider Newton’s method applied to the function f(r) = 2° + 2x? — 3x — 1 with 
a starting approximation of pp = 2. The absolute error in po and the first five 


Section 2.4 Newton’s Method 101 


Newton’s method approximations is listed in the table below. Also listed in the 
table is the ratio |en|/|én—1|°. 


n Absolute Error, fen} |én|/[én—il? 
0  8.0130876 x 107! 

1: 2.7189699 x 107} 0.42345 

2 48441435 x 107? 0.65525 
3 2.0074889 x 1073 0.85549 
4  3,6829405 x 1078 0.91387 
5 1,2432499 x 1971! 0.91657 


Note that the ratio |en|/|en—1|? approaches a constant, thereby providing numerical 
confirmation of the quadratic convergence of the sequence. Further, the error ratio 
appears to be approaching 

fp) 


2f'(p) 


providing numerical confirmation that the asymptotic error constant for Newton's 
method is A= f’(p)/2f'(p). 


~ 0.916586, 


Newton's Method with Roots of Multiplicty > 1 


The final issue that we will discuss at this time regarding Newton's method is the 
performance of the method when f’(p) = 0. If f(p) = f'(p) = 0, then f must have 
a zero of multiplicity m > 2 at x = p. This implies that f can be written in the 
form 


f(a) = (x - p)4(z), 


where lima.» ¢(z) # 0. Substituting this expression for f into the Newton’s method 
iteration function 


_, fe) 
HO) = 2 Fey’ 
we find ( Jal ) 
= ET PJNet) 
a2) = 8 — Eee) + male) 
and 


aia) [m(m — 1)q(x) + 2m(x — p)q'(x) + (x — p)*q"(x)] a(x) 
[(x — p)q'(x) + ma(z)]? 


Therefore, g(p) = p, but g/(p) = 1—1/m, which is nonzero for any root of multiplic- 
ity greater than one. Accordingly, Newton’s method provides only linear conver- 
gence for roots of multiplicity greater than one. Since the asymptotic error constant 
is given by g'(p), the rate of convergence in these cases for Newton’s method will 
be O((1—1/m)"). Note that for a root of multiplicity greater than two, this rate 
of convergence is slower than that of the bisection method. 
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EXAMPLE 2.10 Newton’s Method for a Problem with a Root of 
Multiplicity > 1 


Consider the function f(x) = x(1—cosz), which has a root of multiplicity three at 
z= 0. The following table shows the results of ten iterations of Newton's method 
applied to this problem with a starting value of po = 1. For comparison, the results 
of the bisection method, starting from the interval [-2, 1] are shown in the third 
column. 


Newton’s Method Bisection Method 


l 0.6467039965 —0.5000000000 
2 0.4259712109 0.2500000000 
3 0.2825304410 —0.1250000000 
4 0.1879335654 0.0625000000 
5 0.1251658102 —0.0312500000 
6 0.0834075192 0.0156250000 
7 0.0555942620 —0.0078125000 
8 0.0370596587 0.0039062500 
9 0.0247054965 —0.0019531250 
10 0.0164700517 0.0009765625 


For this problem, the bisection method significantly outperforms Newton’s method, 
especially considering that Newton’s method uses two function evaluations per it- 
eration while the bisection method uses just one. 

Furthermore, notice that with Newton’s method, the error in each approxi- 
mation is roughly two-thirds of the error in the previous approximation. This is 
due to the root being of multiplicity three. With a root of multiplicity three, the 
analysis given above implies that the rate of convergence for Newton’s method will 
be O ((2/3)”). 


Application Problem 1: Volume of Chlorine Gas 


The van der Waals equation, 


which relates the pressure (P), volume (V) and temperature (T) of a gas, was 
introduced in the Chapter 2 Overview (see page 54). Here, n represents the number 
of moles of gas present and R is the universal gas constant. The term involving 
the parameter a corrects the pressure for intermolecular attractive forces, while the 
term involving the parameter 0 is a correction for that portion of the volume of the 
gas that is not compressible due to the intrinsic volume of the gas molecules. 

Suppose that one mole of chlorine gas has a pressure of 2 atmospheres and a 
temperature of 313 K. For chlorine gas, a = 6.29 atm - liter?/mole? and 6 = 0.0562 
liter/mole. What is the volume of the gas? 
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In the units of this problem, the universal gas constant has the value R = 
0.08206 atm - liter/mole - K. We will solve for the volume using Newton’s method. 
The van der Waals equation is first rewritten as the function 


fV) = (P + 7 (V ~ nb) — nRT. 


The derivative of this function is 


na = 2n?a 


PW) 3 Pt oe a 


(V — nb). 
A convergence tolerance of 5 x 1077 is used, and a maximum of 10 iterations are 
allowed. The initial approximation for the volume is taken from the ideal gas law: 

1 mole)(0.08206 atm - li . : 

v= nRT — (1 mole)(0.08206 atm - liter/mole - K)(313 K) 2.45 Bi580 tikes 
P 2 atm 

The actual volume is found to be V = 12.6510993 liter, which is roughly 1.5% 
below the ideal gas law value. Three iterations of Newton’s method were needed 
to achieve convergence. 


Application Problem 2: Location of Maximum in an Energy Distribution 


The energy density w within an isothermal blackbody enclosure is given by Planck’s 


radiation law 
__ 8ncha~> 
y= ech/ART _ y? 


where \ is the wavelength of the radiation, ¢ is the absolute temperature of the 
blackbody, h is Planck’s constant, k is Boltzmann’s constant and C’ is the speed of 
light. To determine the wavelength which maximizes the energy density, we first 


calculate 
dp 8achA~8 (ch/AkT eRe? 
ad) ech/AkT 4 5 ech/AkT — Y . 


The term in front of the parentheses is zero in the limits as A — 0 and as A > ow; 
however, both of these situations give rise to minima in the energy density. The 
maximum we are seeking arises when the term inside the parentheses is zero. This 
happens when 

p= ch = ea ch/AmaxkT 

5AmaxktT , 

where Amax is the wavelength that maximizes the energy density. If we let rx = 
ch/XmaxkT, then the equation for the maximum becomes 


Let’s define 
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and calculate 


f(a) =e + 5. 


Note that f has a zero at z = 0, but we know that we don’t want that root. Since 
the line 1 — % has an intercept at « = 5 and e~® = 6.74 x 1073, it is likely that f 
has a zero near s = 5. Applying Newton’s method with p) = 5 ande=5x 10-3, 
two iterations produce the approximation x ~ 4.965. Therefore, 


EXERCISES 


1. 


\ ss ch 
max 4.965kT 


Each of the following equations has a root on the interval (0,1). Perform New- 
ton’s method to determine pg, the fourth approximation to the location of the 
root, 


(a) In(1+2) ~cosz =0 (b) 2° +22-1=0 
(c) e *-2z=0 (d) cosx~2 =0 


. Construct an algorithm for Newton’s method. Is it necessary to save all calcu- 


lated terms in the sequence {pp }? 


In Exercises 3-6, an equation, an interval on which the equation has a root, and the 
exact value of the root are specified. 


(1) Perform five (5) iterations of Newton’s method. 
(2) For n > 1, compare [py — Pr—1| with |pn—i — p| and [pn — pl. 
(3) For n > 1, compute the ratio [pn — pl/|pn—-1 — pl? and show that this value 


XN DO BS 


approaches | f’"(p)/2f’(p)|. 


. The equation a + 2? — 3c — 3 = 0 has a root on the interval (1,2), namely 


r= V3. 


ua 


. The equation z” = 3 has a root on the interval (1,2), namely z = V7. 

. The equation 1° — 13 = 0 has a root on the interval (2,3), namely #13. 

. The equation 1/2—37 = 0 has a zero on the interval (0.01, 0.1), namely z = 1/37. 
. Show that when Newton’s method is applied to the equation x? ~ o = 0, the 


resulting iteration function is g(x) = 5 (a + 2). 


. Show that when Newton’s method is applied to the equation l/z — a = 0, the 


resulting iteration function is g(a) = (2 — az). 


. The function f{%) = sinz has a zero on the interval (3,4), namely 2 = 7. 


Perform three iterations of Newton’s method to approximate this zero, using 
po = 4. Determine the absolute error in each of the computed approximations. 
What is the apparent order of convergence? What explanation can you provide 
for this behavior? (Note: If you have access to Maple, perform five iterations 
with the Digits parameter set to at least 100.) 


10. 


11. 


12. 


13. 


14. 


15. 
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(a) Verify that the equation «* — 182% +45 = 0 has a root on the interval (1, 2). 
Next, perform three iterations of Newton's method, with p) = 1. Given that 
the exact value of the root is z = V3, compute the absolute error in the 
approximations just obtained. What is the apparent order of convergence? 
What explanation can you provide for this behavior? (Note: If you have 
access to Maple, perform five iterations with the Digits parameter set to 
at least 100.) 

(b) Verify that the equation z* — 18x? + 45 = 0 also has a root on the interval 
(3, 4). Perform five iterations of Newton’s method, and compute the absolute 
error in each approximation. The exact value of the root is « = 15. What 
is the apparent order of convergence in this case? 

(c} What explanation can you provide for the different convergence behavior 
between parts (a) and (b)? 

The function f(x) = 2721 + 162x° — 180x? + 622 — 7 has a zero at x = 1/3. 

Perform ten iterations of Newton’s method on this function, starting with pp = 0. 

What is the apparent order of convergence of the sequence of approximations? 

What is the multiplicity of the zero at 2 = 1/3? Would the sequence generated 

by the bisection method converge faster? 


Repeat Exercise 11 for the function 


f(z) £ 500 ( =) 


“1422 841 125 


which has a zero at x = 2.5. Start Newton’s method with po = 2. 


The function f(z) = 2? + 2x* — 32 — 1 has a zero on the interval (—1,0). 
Approximate this zero to within an absolute tolerance of 5 x 107°. 

For each of the functions given below, use Newton’s method to approximate all 
real roots. Use an absolute tolerance of 10~® as a stopping condition. 

(a) f(z) =e" +22 -2-4 

(b) f(z) =a — 2? - 10247 

(c) f(x) = 1.05 — 1.047 + Ing 

An equation of state relates the volume V occupied by one mole of a gas to the 


instantaneous pressure p and the Kelvin absolute temperature t of the gas. The 
Redlich-Kwong equation of state is given by 


_ RT = a 
“V=b V(V+b)/T’ 


where a and b are related to the critical temperature T. and the critical pressure 
P. by the equations 


Orbs 
a = 0.42747 (= and b = 0.08664 CB). 


c c 


P 


The coefficient R is a universal constant equal to 0.08206. 
(a) Determine the volume of one mole of carbon dioxide at a temperature of 


T = 323.15K and a pressure of one atmosphere. For carbon dioxide, T; = 
304.2K and P, = 72.9 atmospheres. 
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16. 


17. 
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(b) Determine the volume of one mole of ammonia at a temperature of T = 
450K and a pressure of 56 atmospheres. For ammonia, T, = 405.5K and 
P, = 111.3 atmospheres. 


In determining the minimum cushion pressure needed to break a given thickness 
of ice using an air cushion vehicle, Muller (“Ice Breaking with an Air Cushion 
Vehicle,” in Mathematical Modeling: Classroom Notes in Applied Mathematics, 
M. S. Klamkin, editor, SIAM, 1987) derived the equation 


h2 2p h2 3 
p(l-6*)+ (ono? = a) p+ pp- (Fr =0, 


where p denotes the cushion pressure, ) the thickness of the ice field, R the size 
of the air cushion, o the tensile strength of the ice, and @ is related to the width 
of the ice wedge. Take 8 = 0.5, r = 40 feet, and o = 150 pounds per square inch 
(psi). Determine p for h = 0.6, 1.2, 1.8, 2.4, 3.0, 3.6, and 4.2 feet. 
A frame structure is composed of two vertical columns and one horizontal beam, 
as shown below. The vertical columns are of length E and have modulus of 
elasticity E and moment of inertia J. The horizontal beam connecting the tops 
of the columns is of length £1 with modulus of elasticity & and moment of 
inertia J,. The structure is pinned at the bottom and free to displace laterally 
at the top. The buckling load, p, for the structure is given by 
2&1 
2 
where KL is the smallest positive solution of 
KL 
= 6. 
kLtankL i 


Suppose B = 30 x 10° Ib/in?, I = 15.2 int, L = 144 in, h = 9.7 int and 
L; = 120 in. Determine the buckling load of the structure. 


roy 


P = (kL) 


Figure 2.9 
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2.5 SECANT METHOD 


Newton’s method is an extremely powerful rootfinding technique. With a “good” 
starting approximation, the sequence generated by Newton’s method converges 
very rapidly—quadratically, in fact. However, Newton’s method does require two 
new function evaluations per iteration, as well as knowledge of the derivative of 
the function whose zero is being approximated. In this section, we will develop a 
rootfinding technique known as the secant method, which addresses both of these 
negative aspects associated with Newton’s method. 


Secant Method 


The secant method can actually be viewed as a variation on either the method of 
false position or Newton’s method. Like the method of false position, the secant 
method computes the next approximation, pr+i1, as the x-intercept of a line that 
passes through two points on the graph of f. The distinguishing features of the 
secant method are as follows: First, no attempt is made to maintain an interval 
that contains the root; and, second, the line from which p,»4+1 is calculated is passed 
through the points associated with the current and previous approximations, p,, and 
Pn—1 (see Figure 2.10). The equation of this line is 


f (Pn) — F(Pn—1) 


y- f (Pn) = Dn — Pn (z Prn)s 


SO Pr4i is given by 


Pn — Pn-1 
Pati =Pn—f(p : 
mp oP NPS) A Eo) 
Definition. The SECANT METHOD is the rootfinding scheme based on the 
recurrence relation 


Pn — Pn-1 
acim a CSE Coa (1) 
From this definition, it is clear that the secant method does not require the 
derivative of f. Recognizing the similarity between the formula for pn4i given 
in (1) and the formula for p,4; from the method of false position, it should also be 
clear that a properly constructed secant method algorithm will use only one new 
function evaluation per iteration. 
It is worth noting that equation (1) can also be derived by approximating the 
derivative term in Newton’s method by 


f(pn) = f(Pn—1). 


, ny ag Sr SAREE erase i ci 
f (Pn) Pn — Pn-1 , 


in other words, the slope of the tangent line at z = p, is replaced by the slope of 
the secant line formed between x = p,, and @ = pn— 1. Since the calculation of pai 
requires both py and pp—1, the secant method needs two starting values, pp and pi, 
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y =fx) 


Past Pn Pn-} 


Figure 2.10 The secant method for approximating the zero of a a func 
tion: schematic for a single iteration. 


to initiate the iteration. One obvious choice for starting values is the endpoints of 
an interval which contain the root. 


EXAMPLE 2.11 The Secant Method in Action 


Consider the function f(x) = 2° + 2x2? — 3x —1, which we know has a unique zero 
on the interval (1,2). Taking po = 2 and p,; = 1, the secant method produces 


Pi ~- po 1-2 
P SS tee) 
2 1 fede F(p 0) -1-9 
With p; = 1 and po = 1.1, the next secant method approximation is 
P27 Pi 
= po — f(p2) > 
Bae I) oN stp) 
ll-1 
= 1.1 — (—0.549) —————_ = 12217294900. 
Se ee (-1) 
The next four iterations produce 
pa = ps — f(ps) > = 1.1964853266; 
J (p3) — 7@ 


ps=pa-f OFF) = 1.1986453684; 
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Ps — 


= = 1.1986913364; d 
P6 Ps fs) Fe Ve (ps) au 
Pe — 
—————~ = 1.1986912435. 
Pr = 6 — lps) 5a. Ve = =1 


The approximation p7 is correct to the digits shown and has an absolute error of 
roughly 3.907 x 1072. 


To obtain the approximation p7, the secant method went through six iterations 
(the first iteration produced po, the second ps3, etc.) and evaluated the function 
f at © = Po,P1,P2,P3,P4,Ps5, and pg. So an absolute error of roughly 3.907 x 
107'* was achieved for a total cost of seven function evaluations. For comparison, 
Newton’s method achieved similar accuracy in only four iterations; however, those 
four iterations were performed at a cost of eight function evaluations. Thus, even 
though the secant method took more iterations than Newton’s method to achieve 
a given level of accuracy, the secant method used fewer function evaluations. 

Does this phenomenon occur for other rootfinding problems? Consider the 
function f(z) = tan(az) — x — 6. We know that with po = 0.48, Newton’s method 
produces an absolute error of roughly 2.448 x 10~° with five iterations and, there- 
fore, ten function evaluations. Starting from po = 0.4 and p, = 0.48, the secant 
method reaches similar accuracy (an absolute error of 2.022 x 107°) in eight iter- 
ations, but this amounts to only nine function evaluations. Thus, once again, the 
secant method requires more iterations but fewer function evaluations than New- 
ton’s method to achieve a given level of accuracy. An examination of our third 
standard test problem, locating the zero of f(z) = 2° + 2x? — 3x — 1 on (—3, -2), 
is left for the exercises. 

The influence of the starting approximations po and p, on the performance of 
the secant method will also be explored in the exercises. 


Order of Convergence 


To determine the order of convergence for the secant method, we need to derive the 
corresponding error evolution equation. The first step is to subtract the true root, 
p, from both sides of the recurrence formula for pr+i, yielding 


ein te Pa = Pa-1 
Pati P=Pn~-P F(Pn) Fo) = Fa 1) 


The remaining steps are nearly identical to those used to derive the error evolution 
equation for the method of false position. The details are therefore left as an 
exercise. The end result is 


£'(p) 
+ f(D) (Pa + Pn—1 — 2p) 
As pp and pr—1 approach p, the term in the denominator involving the second 
derivative can be dropped and the leading term in the error is given by 


lentil @ Clen|len—-1|, where C= f" (p)/2f"(p). (2) 


Poti —P® (Pn ~ P)(Pn—1 ~ P) are 
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Now, suppose that the Secant method is of order a with asymptotic error 
constant A; that is, successive errors are related by the asymptotic formula len44 
Alen|*. This relationship can also be written as len] © Alen—1/%, which, when solved 
for |en—1|, yields |én_1| # A~*/e,|'/2. Substituting for len4i| and |en—| in (2) 
leads to 


NS 


Aen|* % Clen|A7 "en p/®, (3) 
Equating powers on |e,| in (3), it follows that a must satisfy the algebraic 
equation a = 1+1/a. The single positive root of this equation is a = (1+ V5) /2. 


Hence, the secant method is of order (1 + V5)/2 = 1.618. Furthermore, equating 
the coefficients of |e,| yields 


mor (BB) 


ss 
EXAMPLE 2.12 Demonstration of Order of Convergence and Asymptotic 
Error Constant 


Consider the secant method applied to the function f(a) = 23 + 2x? — 32 — 1 with 
starting approximations of pp = 2 and p; = 1. The absolute error in po, p, and the 
first six secant method approximations is listed in the following table. Also listed 
in the table is the ratio |en|/lén—y|}-8, 


n Absolute Error, |én| — |en|/lén—1|-°"8 
0 8.0130876 x 107} 


1 1.9869124 x 1073 0.28434 
2 9.8691244 x 1072 1.34842 
3 2.3038247 x 107? 0.97638 
4  2.2059169 x 1073 0.98434 
5 4.5875100 x 1075 0.91128 
6  9.2909772 « 1078 0.97192 
7  3.9070969 x 1072 0.93225 


Note that the ratio |en|/len—1|'*!8 approaches a constant, thereby providing nu- 
merical confirmation that the order of convergence of the sequence is a = 1.618. 
Further, the error ratio appears to be approaching 


1 0.618 
(Fe) 2 0.94759, 
providing numerical confirmation that the asymptotic error constant for the secant 
method is \ = (f"(p)/2f’(p))°°™*. 


The analysis we’ve just completed is based on the assumption that f’(p) 4 0; 
that is, p is a simple zero of f.. We saw in the previous section that the order 
of convergence of Newton's method drops to linear when approximating a zero of 
multiplicity greater than one. In the exercises we will explore whether the same 
fate befalls the secant method. 
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Application Problem: Solving a Crime 


In the Chapter 1 Overview (see the problem capsule “Solving a Crime” on page 2), 
we were presented with the problem of determining time of death from core temper- 
ature measurements. To summarize, Commissioner Gordon had been found dead 
in his office. At 8:00 PM, the county coroner determined the core temperature of 
the corpse to be 90°F. One hour later, the core temperature had dropped to 85°F. 
Captain Furillo believed that the infamous Doc B had killed the commissioner. 
Doc B, however, claimed to have an alibi. Lois Lane was interviewing him at the 
Daily Planet Building, just across the street from the commissioner’s office. The re- 
ceptionist at the Daily Planet Building checked Doc B into the building at 6:35PM, 
and the interview tapes confirmed that Doc B was occupied from 6:40 PM until 
7:15 PM. 

To determine time of death, we used Newton’s Law of Cooling to model the 
temperature of the corpse as a function of time. Taking t = 0 to correspond to 
8:00PM, the time when the first core temperature was taken, we were able to reduce 
the problem to two nonlinear algebraic equations. We will now solve those equations 
with the help of the secant method. 

The first equation that we have to solve is 


1 1 
73-——4+[18+=—]e7* = 85 
k =e ( + z) e ; 
where & is the constant of proportionality from Newton’s Law of Cooling and mea- 
sures the rate at which the corpse loses heat to its surroundings. To prepare for 
using the secant method, we rewrite the above equation as the function 


1 1 
k) =-12-—=+ [18+ —]er*. 
#8) r+ (1845) 

With po = 0.1 and p; = 1, six iterations of the secant method produce the value 
k = 0.337114. Iterations were terminated when |p —Dn—1| fell below « = 5 x 1077. 
(Why is this an appropriate stopping condition for use with the secant method?) 

With the value of & determined, we can now turn to solving the second equa- 
tion, 


k 


where tg is the time of death, measured in hours, and 98.6 is the assumed temper- 
ature of the corpse at the time of death. To prepare for using the secant method 
here, we substitute the value of k and rearrange the equation as 


_i aes —0.337114tg 
0.337114 * (1s 7 aan) : 


T2+tq—- : + (1 + :) e*ta = 98.6, 


f(ta) = -26.6 + tg — 


With py = ~—2, pi: = 0 and the same stopping condition noted above, six iterations 
yield tg = —1.130939. Thus, the time of death was roughly 1 hour and 8 minutes 
prior to 8:00 PM, or 6:52PM. This is right in the middle of Doc B’s interview with 
Lois Lane, so Doc B could not have killed the commissioner. 
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EXERCISES 
1. 
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Each of the following equations has a root on the interval (0,1). Perform the 
secant method to determine p4, the fourth approximation to the location of the 
root. 

(a) n(i+z)—cosz=0 (b) 2° +2r-1=0 

(c) 7 7 ~2=0 (d) cosr-—z=0 


. Construct an algorithm for the secant method. 
. Show that the equation for the secant method can be rewritten as 


a F(pn)Pn-1 a f(pn—1)Pn 
Peye S Flpn) = Fn) 


Explain why this formula is inferior to the one used in the text. 


. Fill in the missing details in the derivation of the error evolution equation 


f"() 
2f'(p) + f!" (p)(pn + Pn—1 — 2p)" 


Pnti —p& (pn — P)(Pn-1 — P) 


In Exercises 5~8, an equation, an interval on which the equation has a root, and the 
exact value of the root are specified. 


(a) 
(b) 
(¢) 


CaON D 


10. 


Perform seven (7) iterations of the secant method. 

For n > 2, compare |pn — pn—i| with |pn—1 — p| and |pn — pl. 

For n > 2, compute the ratio [pa — p\/|pn—1 — p|''®!® and show that this value 
approaches (|f"(p)/27’(p)|) °°". 


. The equation 2* + 2? — 32 — 3 = 0 has a root on the interval (1,2), namely 


r= V3. 


The equation z” = 3 has a root on the interval (1,2), namely z = V3. 


. The equation 2° — 13 = 0 has a root on the interval (2,3), namely #13. 
. The equation 1/a—37 = 0 has a zero on the interval (0.01, 0.1), namely x = 1/37. 
. The function fe} = sinz has a zero on the interval (3,4), namely x = 7. 


Perform five iterations of the secant method to approximate this zero, using 
po = 3 and p; = 4. Determine the absolute error in each of the computed 
approximations. What is the apparent order of convergence? What explanation 
can you provide for this behavior? (Note: If you have access to Maple, perform 
seveitem n iterations with the Digits parameter set to at least 100.) 


(a) Verify that the equation 2? — 182° + 45 = 0 has a root on the interval 
(1, 2). Next, perform five iterations of the secant method, using po = 1 and 
pi = 2. Given that the exact value of the root is = V3, compute the 
absolute error in the approximations just obtained. What is the apparent 
order of convergence? What explanation can you provide for this behavior? 
(Note: If you have access to Maple, perform seven iterations with the Digits 
parameter set to at least 100.) 

(b) Verify that the equation 2* — 18”? + 45 = 0 also has a root on the interval 
(3,4). Perform seven iterations of the secant method, and compute the 


11. 


12. 
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absolute error in each approximation. The exact value of the root is V15. 
What is the apparent order of convergence in this case? What explanation 
can you provide for the different convergence behavior between parts (a) 
and (b)? 
It was observed that Newton’s method provides only linear convergence towards 
roots of multiplicity greater than one. How does the secant method perform 
under such circumstances? Each of the following functions has a zero at the 
specified location. Perform ten iterations of the secant method to locate these 
zeros. Does the sequence generated by the secant method converge with order 
& = 1.618 or has the order dropped to a = 1? 
{a} f(e}] = 2(1 — cosz) has a zero at x = 0 — use po = —1 and yp, = 2 
(b) f(a) = 2724 + 1622° — 1802? + 62x — 7 has a zero at x = 1/3 
(c) f(z) = ye — Hp (1 — 242) has a zero at 2 = 2.5 


Newton’s method approximates the zero of f(z) = z°+227—32-1 on the interval 
(~3, —2) to within 9.436 x 107} in 3 iterations and 6 function evaluations. How 
many iterations and how many function evaluations are needed by the secant 
method to approximate this zero to a similar accuracy? Take po = —2 and 
p= 3. 


In Exercises 13-16 we will investigate the influence of the starting approximations po 
and 7p, on the performance of the secant method. In each exercise, apply the secant 
method to the indicated function using the indicated values for po and p,. Iterate until 
lon —Pn—1] <5 1077. Record and compare the final approximation and the number 
of iterations in each case. 


13. 


14. 


15. 


16. 


17, 


18, 


f(t) =a + 2x? - 32-1 

(a) po =-3, pl =—2 (b) pp = -2, pi =-3 
(c) po =—4, m= -2 (d) po =—-2, pr =—4 
fle) =a 4+ Qn? — 32-1 

(a) po=1, pr=2 (b) po =2, pr=l 

(c) po=3, p= 2 (d) po =2, pr =3 
f(x) =tan(rr) — 2-6 

(a) po =0, pr = 0.48 (b) po = 0.24, py = 0.48 


(c) po = 04, pi = 0.48 

fla)=a% -22-5 

(a) po=1, pr =3 (b) p=], pr =2 

(c) p=3, m=2 

The function f(a) = x* + 2x? - 32-1 has a simple zero on the interval {1,0}. 
Approximate this zero to within an absolute tolerance of 5 x 107°. 


For each of the functions given below, use the secant method to approximate all 
real ane Use an absolute tolerance of 107° as a stopping condition. 

(a) (z) = e* + v -a-4 

(b) Head —2?-l0r+7 

(ec) f(r) = 1.05 - 1.042 + nz 
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19. Keller (“Probability of a Shutout in Racquetball,” SIAM Review, 26, 267~268, 
1984) showed that the probability that Player A will shut out Player B in agame 
of racquetball is given by 

lt+w ( w ) 21 

l-wtw/ ? 
where w denotes the probability that Player A will win any specific rally, inde- 
pendent of the server. Determine the minimal value of w that will guarantee 
that Player A will shut out Player B in at least one-quarter of the games they 
play. Repeat your calculations for at least half the games being shutouts and at 
least three-quarters of the games being shutouts. 

20. A couple wishes to open a money market account in which they will save the 
down payment for purchasing a house. The couple has $13,000 from the sale of 
some stock with which to open the account and plans to deposit an additional 
$200 each month thereafter. By the end of three years, the couple hopes to have 
saved $20,000. If the money market account pays an annual interest of R%, 
compounded monthly, then at the end of three years, the balance of the account 
will be 4 

36 1+5)"-1 
18000 (1 + a) +299 (tt a) = 2 

12 B 
What is the lowest, interest rate which will achieve the couple’s goal of saving 
$20,000? What is the lowest interest rate if the couple can raise their monthly 
deposit to $2507 

21. Suppose it was discovered that Commissioner Gordon had the flu when he died, 
and his core temperature at the time of his death was 103°F. With k = 0.337114, 
solve the equation : 


1 L\ ke 
Wo tpa (1s =) 4 = 103 
+ te k + + E € 
to determine the time of death based on this new information.. Does Doc B’s 
alibi still hold? 


2.6 ACCELERATING CONVERGENCE 


Having spent so much time discussing speed of convergence, a natural question 
to ask would be whether it is possible to accelerate the convergence speed of a se- 
quence. For example, can anything be done to speed up the convergence of a linearly 
convergent sequence? Also, can anything be done to restore quadratic convergence 
to Newton’s method when attempting to approximate a root of multiplicity greater 
than one? These questions will be addressed in this section. 


Aitken’s A?-Method 


Let’s start. by accelerating the convergence of a linearly convergent sequence. Thus 
far, the only truly linearly convergent sequences we’ve encountered have been gen- 
erated by either the method of false position or fixed point iteration. Remember 
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that we had to stretch the definition of linear convergence to make the bisection 
method fit. 

During our development of the method of false position, we found that the 
error associated with the n-th term in the false position sequence, p,, can be esti- 
mated by the formula 


» 
D— Pn © 7 (Pn — Pat): (1) 
Here, p denotes the limit value of the sequence, and 


ie Pn — Pn-1 (2) 
Pn—-1 — Pn-2 


Similarly, we found that the error associated with the n-th term in a sequence 
generated by fixed point iteration can be estimated by the formula 


os Tm = Pn—1)) (3) 
where g is the iteration function and 


~ Pn —~ Pn-1 
gf(p) » Pe Pe () 
Substituting (2) into (1) or (4) into (3) and solving the resulting expression for p 
yields 
( in ms (5) 
Pn + Pn-2 ~ 2Pn—1 
Given the approximate nature of (1), (2), (3), and (4), the value given by (5) is 
not likely to be the exact limit of the sequence, but it should, at least, be a better 
approximation to that limit than p,. This is the fundamental idea behind what is 
known as Aitken’s A?-method. 
From a linearly convergent sequence {pn}, Aitken’s A*-method constructs the 
sequence {f,} according to the rule 


p® Pr — 


eae (Pn = Pn—1)” 
7 a Dn + Pn-2 - 2Pn-1 


The formula for #, is usually written in a more compact notation—from which the 
method derives its name. Let A denote the differencing operator that is defined by 
the relation Ap, = pp —pn-.. The numerator of the second term on the right-hand 
side of the formula for #, can then be expressed as (Apa). As for the denominator, 
note that 


Pn — 2Pn-1 + Pn-2 = (Pr as Pn-1) me (Pn—1 — Pn—2) 
= Ap, ~ APn-1 
= AN.) = A? 05: 
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The formula for p, can therefore be written as 
a (Apr)? 
Pn Ss Pa AD, - 


The sequence {jn} is guaranteed to converge more rapidly than the sequence {pp} 
in the sense that 


n=00 |Dn — pl 
A proof of this statement will be developed in Exercise 16. 


EXAMPLE 2.13 Accelerating the Method of False Position 


In Section 2.2, the method of false position was used to approximate the zero of 
f(z) = 2° + 2a? — 3x — 1 on the interval (1,2). Here, we will accelerate the 
convergence of the false position sequence by applying Aitken’s A?-method. 

Note that the formula for #, requires three consecutive terms from the pp 
sequence. We therefore start by using the method of false position to calculate 
p, = 1.1, po = 1.1517436381 and p3 = 1.1768409100. Substituting these values into 
the #, formula for n = 3, we find 


_(P3 =a)? 
p3 +p, — 22 


= 1.1768409100 — 


p3 = D3 - 
(1.1768409100 — 1.1517436381)? 
1.1768409100 + 1.1 — 2(1.1517436381) 


= 1.2004791447. 


We now return to false position and calculate p, = 1.1886276733. Substituting the 
values for po, p3 and p4 into the #, formula with n = 4 then yields 
(1.1886276733 — 1.1768409100)? 


Pre ear oe 1.1886276733 + 1.1517436381 — 2(1.1768409100) 


= 1.1990651249. 


Continuing in this fashion, alternating between the method of false position and 
Aitken’s A?-method, we obtain the values listed below. Recall that to ten decimal 
places p = 1.1986912435. 

False Position, pz  Aitken’s A*, dn 
1.1000000000 

1.1517436381 


CAOuUrwNnNe S 


$$ 


1.1768409100 
1,1886276733 
11940789113 
1.1965820882 
1.1977277544 
1,1982513178 


1.2004791447 
1.1990651249 
1.1987692873 
1.1987075172 
1.1986946351 
1.1986919502 
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Clearly the sequence generated by Aitken’s A?-method is converging faster— 
pg is accurate to three decimal places, while fg is accurate to six decimal places. 
The next table demonstrates that both sequences are converging linearly, but the 
asymptotic error constant for the ji, sequence is less than half the asymptotic error 
constant for the p, sequence. Thus, Aitken’s A?-method accelerates convergence 
not by increasing the order of convergence, a, but by reducing the asymptotic error 
constant, A. 


False Position Aitken’s A? 
n Absolute Error, len| — |én|/|en—1! Absolute Error, |en|— |en|/en—1| 
1 —-9.8691244 x10-? 
Z 4.6947605 x 107? 0.4757 
3 2.1850334 x10-? 0.4654 1.7879012 xi0~3 
4 1.0063570 x 107? 0.4606 3.7388139 x10~4 0.2091 
5 4,6123322 x1073 0.4583 7,.8048755 x107> 0.2087 
6 2.1091553 x10-3 0.4573 1.6273731 x1076 0.2085 
7 9.6348913 x 1074 0.4568 3.3916255 x 10-6 0.2084 
8 43992572 x10~4 0.4566 7.0667577 x10-7 0.2084 


EXAMPLE 2.14 Accelerating Fixed Point Iteration 


Next, let’s accelerate the convergence of fixed point iteration when applied to ap- 
proximate the fixed point of the function g(r) = e~*. Here, we proceed exactly as 
we did above. With po = 0, we first use fixed point iteration to calculate p, = 1, 
po = 0.3678794412, and pz = 0.6922006276. Then, substituting these values into 
the 6, formula for n = 3, we find , 
Re as (p3— Po)? (0.6922006276 — 0.3678794412)? 
Ba Pa py = Bpy 022006276 — 5 5599006276 + 1 — 2(0.3078794419) 


= 0.5822260970. 


We continue by calculating one term in the p,, sequence followed by one term in 
the #, sequence. After ten iterations of fixed point iteration, we have the values 
listed below. ‘To ten decimal places, the fixed point of g is « = 0.5671432904, 


n Fixed Point, p, Aitken’s A?, G, 


i: 1.0000000000 

2  0.3678794412 

3 0.6922006276 0.5822260970 

4 0.5004735006 0.5717057675 

5 —:0.6062435351 0.5686388059 

6 0.5453957860 0.5676169948 

7 0.5796123355 0.5672967525 

8  0.5601154614 0.5671924279 

9  0.5711431151 0.5671591338 
10 —-0.5648793474 0.5671483792 
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Again, it is clear that the sequence generated by Aitken’s A?-method is con- 
verging faster—pio is accurate to only two decimal places, while #y9 is accurate 
to five decimal places. The next table shows that both sequences are converging 
linearly, but the asymptotic error constant for the Pn sequence is less than 60% of 
the asymptotic error constant for the pp sequence. 


Fixed Point Aitken’s A? 
nm Absolute Error, Jen|  en!/Jen—1| Absolute Error, jen!  |én|/len—1| 
1 0.4328567096 
2 0.1992638492 0.4603 
3 1.2505734 x1072 0.6276 1.5082807 x10? 
4 6.6669790 x 10-2 0.5331 4.5624771 x107% 0.3025 
5 3.9100245 «1072 0.5865 1.4955155 «1073 0.3278 
6 2.1747504 «107? 0.5562 4.7370444 «10-4 0.3167 
7 1.2469045 x10~? 0.5734 1.5346208 x1074 0.3240 
8 7.0278290 «1073 0.5636 4.9137477 x1075 0.3202 
9 3.9998247 x 107 0.5691 1.5843424 «1075 0.3224 
10  2.2639430 x1078 0.5660 5.0888172 «1076 0.3212 


Steffensen’s Method 


For linearly convergent fixed point iteration schemes of the form pai = (pn), it 
is possible to accelerate convergence even further by applying a variation of the 
Aitken’s A’-method. The basic idea can be explained as follows. Suppose the 
starting approximation po is given and the values p1 = g(po) and po = g(pi) are 
calculated. Aitken’s A?-method is then applied to compute #. Since is supposed to 
be a better approximation to the fixed point than pe, it seems counterproductive to 
continue the iteration using py. Why not reinitialize the iteration function using 6? 
This three-step process is then repeated: From the current approximation, perform 
two fixed point iterations and then combine the current approximation and the 
two intermediate values according to the Aitken’s A? formula to form the next 
approximation. 

The scheme just described is known as Steffensen’s method. The sequence of 
calculations is depicted in Figure 2.11. The sequence of approximations is denoted 
by {fn}, and for consistency, the initial approximation is denoted by py. Finally, 
Pin and pon denote the intermediate values calculated using the iteration function 
starting with z = #,. With this notation the Aitken’s A? formula takes the form 


(pan = Pin)? 
P2n 21,7 + Dn 


Prt = Pan 


Let’s do an example to determine how much acceleration this technique pro- 
duces over the basic strategy of Aitken’s A*-method. 
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Po 
Pio = 8(Bo) > B, ; 
Pro > (Pro) A= 2(B,) —“ > p, 
Pr, =(P,1) A= s(P,) a Ds 
Poa = 8(Pra) Pis = 8(Bs) [2 


Py3> a(P,3) 


Figure 2.11 Graphical depiction of the sequence of calculations in 
Steffensen’s method. The sequence of approximations is denoted by the 
® values. The values pin and pe, are intermediate values used in the 
Aitken’s A? formula. 


EXAMPLE 2.15 Steffensen’s Method in Action 


Let’s reconsider the fixed point iteration problem examined above—this time using 
Steffensen’s method to accelerate convergence. With a starting approximation of 
po = 0, we calculate p19 = 9(f0) = 1, poo = 9(p1,0) = 0.3678794412, and 


a (0.3678794412 — 1)? 
Pr = 0.3078 794412 — - ee re794412 +0 = 21) 


= 0.6126998368. 


Reinitializing the iteration with ~,, we obtain pi; = g(p1) = 0.541885888, 
p21 = 9(pi,1) = 9.5816502896, and 


(0.5816502896 — 0.541885888)? 


Pa = 0.5810502896 ~ 75516502806 + 0.6126008368 — 2(0.541885888) 


= 0.5673508577. 


At this point, note that #2 is more accurate than the tenth term in the sequence 
generated by fixed point iteration. 

The third iteration of Steffensen’s method produces fz = 0.5671432948. This 
value is correct to eight decimals places and has an absolute error of roughly 4.421 x 
107%. Thus, at a cost of only six evaluations of the function g, Steffensen’s method 
has produced a significantly more accurate approximation to the fixed point than 
was obtained with Aitken’s A?-method at a cost of ten function evaluations. 

Having obtained such a small absolute error in so few iterations suggests that 
Steffensen’s method has done more than just reduce the asymptotic error constant 
for a linearly convergent sequence. Examining the final column in the following table 
provides evidence that the sequence generated by Steffensen’s method is converging © 
quadratically. 
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n len| = IPn — p| len|/len—1|? 
0 5.6714329 x107} 

1 4.5556546 x107-2 0.14163 
2 2.0756729 x1074 0.10001 
3 4.4209310 x1079 0.10261 


Under fairly mild conditions, it can be shown (see Isaacson and Keller (1]) that 
starting from an iteration function that produces linear convergence, Steffensen’s 
method will produce quadratic convergence. This is accomplished with two func- 
tion evaluations per iteration, the same as Newton’s method, but does not require 
knowledge of the derivative. 


Restoring Quadratic Convergence to Newton’s Method 


This brings us to the problem of restoring quadratic convergence to Newton’s 
method when a root of multiplicity greater than one is being approximated. There 
are two different approaches that can be taken. In the first approach, the func- 
tion to which Newton’s method is applied is modified in such a way that the root 
being approximated is guaranteed to be a simple root. In the second approach, 
the iteration function of Newton’s method is modified so that roots of a particular 
multiplicity can be approximated with a quadratically convergent sequence. Both 
techniques will now be developed, and the merits and shortcomings of each will be 
discussed. 

Let’s start with the approach that modifies the function to that Newton’s 
method is applied. Suppose f has a root of multiplicity m at 2 = p, and con- 
sider the function f defined by F(x) = f(x)/f’(z). Since f can be written in 
the form f(x) = (x — p)™g(x), where limz—pgq(z) # 0, it follows that f’(z) = 
(x — p)™~? [(x — p)q' (x) + mag(a)] and 


Fle) = 7 (e—P)a() ___ @ — pala), 


x2 — p)q' (x2) + mg{z} 


where lim;—» @(z) = 1/m # 0. Therefore, f has a simple root at x = p, which 
implies that Newton’s method applied to f is guaranteed to converge quadratically. 

Substituting F(x) = f(x)/f’(z) into the iteration function for Newton’s 
method yields 


AC eae 
Fie) 
pS e/F@) 
Poy 
ia)f"(2) 


=f 


(f(z)? — F(a) f"(@) 
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This last formula points out the two main disadvantages of this approach. First, 
both the first and the second derivatives of f are needed. Second, each iteration 
requires three function evaluations. On the other hand, this approach will work 
regardless of the multiplicity of the root and requires no prior knowledge of that 
multiplicity. 


EXAMPLE 2.16 Restoring Quadratic Convergence to Newton's 
Method—Approach 1 


Consider the function f(z) = 1+4+Ina—-. It is clear that this function has a root 
at z= 1. The multiplicity of this root happens to be two. The results of applying 
Newton’s method to f(x)}--standard Newton’s method implementation—and to 
f(z)/f'(z)—modified implementation—with a starting approximation of pp = 2 
and a convergence tolerance of 10-° are given below. 


f(z) 


F(e)/f'(@) 


1 1.3862943611 1.1146099182 
2 1.1721921890 1.0036656204 
3 1.0815404027 1.0000044517 
4 0397051441 1.0000000001 
5 1.0195949175 
6 0097340850 
7 1.00485 13268 
8 L.0024217503 
9 1.0012098989 
10 1.0006047056 
ia 1.0003022919 
12 1.0001511307 
13 1.0000755615 
14 1.0000377798 
15 0000188897 
16 1.0000094448 


The sequence generated by the standard Newton’s method implementation clearly 
converges only linearly. Note that the error is cut by one-half with each iteration, 
exactly what is to be expected when approximating a root of multiplicity two. The 
modified implementation reduced the number of iterations by a factor of four. Even 
with the extra work required for each iteration, the modified approach has reduced 
the overall amount of work dramatically—from 32 function evaluations down to 
only 12. 


The second approach to restoring quadratic convergence to Newton’s method 
makes a modification to the method’s iteration function, g. Recall that it was shown 
in Section 2.4 that Newton’s method converges only linearly to roots of multiplicity 
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m. > 1 because g'(p) = 1—1/m, which is nonzero for m #1. The term 1/m comes 
from the value of the derivative of the term f(x)/f’(z) in the iteration function. 
This suggests multiplying the term f(«)/f’(x) by m and replacing the standard 
Newton’s method iteration function with 


f(x) 

f(z) 

The derivative of this iteration function, evaluated at z = p, would then be equal 

to 1 — (1/m) x m = 0, implying quadratic convergence of the generated sequence. 
This approach to restoring quadratic convergence to Newton’s method has 

the advantage over the previous approach of requiring no new function evaluations. 

Unfortunately, this approach does require a priori knowledge of the multiplicity of 

the root. When the multiplicity is known, however, it can be supplied as an input 

parameter to the rootfinding routine. In Exercise 13 we will explore a procedure 

for estimating the multiplicity of a root. 


g(z) =2—m 


EXAMPLE 2.17 Restoring Quadratic Convergence to Newton’s 
Method—Approach 2 


Reconsider the function f(z) = 1+Inxz—< that has a root of multiplicity two at 
2 = 1. With a starting approximation of pp = 2, a convergence tolerance of 107°, 
and using m = 2 in the iteration function g, we obtain the results 


0.7725887222 
0.9804852866 
0.9998718053 
0.9999999945 
0.9999999945 


oh wn 


Even with one additional iteration needed to achieve convergence, fewer function 
evaluations (10 versus 12) were performed with this approach than Approach 1, 
above—for this example, at least. 


References 
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EXERCISES 
1. Show that the equation for Aitken’s A?-method can be rewritten as 


. _ _ PnPn—-2 — Pa-1 
pr- 2Pn-1 + Pn—2 


n= 


Explain why this formula is inferior to the one used in the text. 
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2. Should Aitken’s A?-method be applied to a sequence generated by the bisection 
method? Explain. 


3. The sequence listed below was obtained from the method of false position applied 
to the function f(x) = tan(ax) — x — 6 over the interval (0.40, 0.48). 
1 0.420867411 

0.433202750 

0.440495739 

0.444807925 

0.447357748 

0.448865516 

7 0.449757107 

(a) Apply Aitken’s A?-method to the given sequence. 

(b) To nine digits, the zero of f on (0.40, 0.48) is « = 0.451047259. Use this 
to show that both the original sequence and the output from Aitken’s A?- 
method are linearly convergent and estimate the corresponding asymptotic 
error constants. By how much has Aitken’s A?-method reduced the asymp- 
totic error constant? 


Doe Ww bw 


4, The sequence listed below was obtained from Newton’s method applied to the 
function f(x) = 2{1 — cos) to approximate the zero at x = 0. 
1 0.646703997 

0.4259712114 

0.282530441 

0.187933565 

0.125165810 

0.083407519 

7 0055594262 

(a) Apply Aitken’s A?-method to the given sequence. 

(b) Verify that both the original sequence and the output from Aitken’s A?. 
method are linearly convergent and estimate the corresponding asymptotic 
error constants. By how much has Aitken’s A?-method reduced the asymp- 
totic error constant? 


ao Pw Pw 


5. The sequence listed below was obtained from fixed point iteration applied to the 
function g(x) = \/10/(2 +2), which has a unique fixed point. 
2.236067977 
1,536450382 
1.681574897 
1.648098560 
1.655643081 
1.653933739 
7 = 1.654320556 
(a) Apply Aitken’s A?-method to the given sequence. 
(b) To ten digits, the fixed point of g is c = 1.654249158. Use this to show 
that both the original sequence and the output from Aitken’s A?-method 
are linearly convergent and estimate the corresponding asymptotic error 


constant. By how much has Aitken’s A?-method reduced the asymptotic 
error constant? 


aoa kRWMWeE 
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10. 


11. 


12. 
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. Apply Steffensen’s method to the iteration function g(z) = 5 10 — <? using a 


starting value of po = 1. Perform four iterations, compute the absolute error in 
each approximation and confirm quadratic convergence. To twenty digits, the 
fixed point of g nearest z = 1 is x = 1.3652300134140968438. 


. (a) Perform ten iterations to approximate the fixed point of g(x) = cosx us- 


ing po = 0. Verify that the sequence converges linearly and estimate the 
asymptotic error constant. To 20 digits, the fixed point is 


x = 0.73908513321516064166. 
(b 


— 


Accelerate the convergence of the sequence obtained in part (a) using Ait- 
ken’s A*-method. By how much has Aitken’s A?-method reduced the 
asymptotic error constant? 

(c) Apply Steffensen’s method to g(x) = cosz using the same starting ap- 
proximation specified in part (a). Perform four iterations, and verify that 
convergence is quadratic, 


. (a) Perform ten iterations to approximate the fixed point of g(z) = In(4 + 


g- x?) using po = 2. Verify that the sequence converges linearly and 
estimate the asymptotic error constant. To 20 digits, the fixed point is 
x = 1,2886779668238684115. 

(b) Accelerate the convergence of the sequence obtained in part (a) using Ait- 
ken’s A?-method. By how much has Aitken’s A?-method reduced the 
asymptotic error constant? 

(c) Apply Steffensen’s method to g(r) = In(4+-2— 2”) using the same starting 
approximation specified in part (a). Perform four iterations, and verify that 
convergence is quadratic. 


. (a) Perform ten iterations to approximate the fixed point of g(x) = (1.05 + 


Inz)/1.04 using po = 1. Verify that the sequence converges linearly and 
estimate the asymptotic error constant. To 20 digits, the fixed point is 
x = 1.1097123038867133005. 

(b) Accelerate the convergence of the sequence obtained in part (a) using Ait- 
ken’s A?-method. By how much has Aitken’s A?-method reduced the 
asymptotic error constant? 

(c) Apply Steffensen’s method to g(x} = (1.05 + Inz)/1.04 using the same 
starting approximation specified in part (a). Perform five iterations, and 
verify that convergence is quadratic. 

The function f(#} = 2727+ 1622 — 18027 +622 —7 has a zero of multiplicity 3 at 

x = 1/3. Apply both techniques for restoring quadratic convergence to Newton's 

method to this problem. Use po = 0, and verify that both resulting sequences 

converge quadratically. 

The function f(z) = Tat - eo ( - 32) has a zero of multiplicity 2 at ¢ = 2.5. 

Apply both techniques for restoring quadratic convergence to Newton’s method 

to this problem. Use po = 2, and verify that both resulting sequences converge 

quadratically. 

The function f(z) = x(1 — cosz) has a zero of multiplicity 3 at c = 0. Apply 

both techniques for restoring quadratic convergence to Newton’s method to this 

problem, using po = 1. You should observe that the resulting sequences appear 


13, 


14. 


15. 


16. 
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to converge faster than quadratically. What apparent order of convergence do 
you observe? Why is convergence faster than quadratic for this problem? 


Suppose Newton’s method is applied to a function with a zero of multiplicity 


m > 1. Show that the multiplicity of the zero can be estimated as the integer 
nearest to ‘i 

Pn-Pa-1 " 
ae er rer 


mAs 


Verify that this formula produces an accurate estimate when applied to the 
sequence listed in Exercise 4 and when applied to the sequence generated when 
Newton’s method was applied to the function f(z) = 1+Inz — z in the text. 
Each of the following functions has a zero of multiplicity greater than one at 
the specified location. In each case, apply the secant method to the function 
f(x)/f'(z) to approximate the indicated zero. Has the order of convergence 
been restored to @ = 1.618? 

(a) f(®) =1+\lnz—z has a zero at x = 1 - use pg = —1 and py = 2. 

(b) f(x) = 2724 + 162%° — 180x? + 62a — 7 has a zero at x = 1/3. 

(c) fq@)=W- aay (1 — 338) has a zero at x = 2.5. 

Repeat Exercise 14, but this time replace the standard secant method formula 
for Pn+, by the formula 


Pn — Pn-} 
Prt = Pn — Mf (Pn) Pee, 
Bs = Pec LOA eet eal) 
where m is the multiplicity of the zero being approximated. The functions in (a} 
and (c) have m = 2, and the function in (b) has m = 3. 
The method of false position and fixed point iteration generate linearly conver- 
gent sequences for which 
m RP (6) 
n—0co Ppn—1 — Pp 
exists. Note that this limit does not involve absolute values. Let A denote 
the value of this limit. This exercise will lead us through the proof that the 
sequence produced by Aitken’s A?-method converges more rapidly than linearly 
convergent sequences for which (6) exists. 
(a) Let 
fe ey 
Pn-1—P 


Show that «n ~ 0 asn — oo. 
(b) Show that 


1 
NeerGs =: (G- i 
Pp (pn ?) En +A 


(c) Show that 


1 1 1 
Ann = (Pn — p) | — 1 
. ( “| ex tr a an) 


= Pn —P = ! 
ae eS [a0 +h], 
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where €, = €n€n—1 + A(En + €n—1) — 2€n-1. Further, show that ¢, > 0 as 
n ~> 00. 


(d) Show that 


Po-P_y_ fait) (en t+dr-1)", 
Pn —Pp e&xtrA  (A-1)2? +4,’ 
hence, as n — co, 
Pr=P 9 
Pn —?P 


2.4 ROOTS OF POLYNOMIALS 


The previous sections have been devoted to the development of techniques for solv- 
ing the rootfinding problem with an arbitrary nonlinear function. We conclude 
the chapter by examining the special problem of locating the roots of polynomial 
functions. Such polynomial rootfinding problems arise in a variety of situations, in- 
cluding the solution of constant coefficient differential equations and the derivation 
of numerical integration techniques known as Gaussian quadrature rules (a topic 
which will be covered in Chapter 6). 


Working with Polynomials 


Anyone who has worked with MATLAB, Maple, or Mathematica has no doubt recog- 
nized that these systems treat polynomials differently from other functions when it 
comes to rootfinding. For instance, MATLAB provides two built-in m-files: roots, 
which is designed for polynomials and computes all roots, and fzero, which can 
be used with any type of function but determines one root only. The Maple and 
Mathematica rootfinding commands can take any type of function as input but 
internally distinguish between polynomials, for which all roots are computed, and 
non-polynomials, for which only one root is determined. 

Why are polynomials treated differently from other functions? The answer 
to this question comes in two parts. First, although there is no general theory for 
the number of roots that an arbitrary nonlinear equation has, as a consequence of 
the Fundamental Theorem of Algebra, every nth-degree polynomial has precisely n 
roots in the complex plane, counting multiplicities. When working with polynomials 

_ with real coefficients (as we will do exclusively in this section}, there is the additional 
fact that complex roots can occur only in conjugate pairs. Therefore, if 3+ is 
found to be a root of a given polynomial with real coefficients, 3 — 7 must also be 
a root. 

The second reason that, polynomials are treated differently with regard to 
rootfinding is the following. Let p be an nth-degree polynomial, and suppose that 
g = x* is a root of p. Then the monomial z — x* can be factored from p, leaving 
p(x) = (2 ~ z*)q{x), where q is a polynomial of degree n — 1. Hence, polynomial 
rootfinding possesses a natural reduction of order. This process of removing a 
previously determined root and reducing the size of the remaining problem is known 
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as deflation. We will encounter deflation again in Chapter 4 when working with 
the algebraic eigenvalue problem. 
Polynomial Deflation 


Suppose the nth-degree polynomial p is given by 
p(x) = Ont” + Cae © as yoga? +++); + ap. 


Further, suppose that z = z* is aroot of p. Then, as noted above, p can be written 
in the form p(x) = (x ~ z“j}q{xz), where 


3 


Qt) = Ont"! + bnaz”? + by_gz” 2 +--+ bye + bo 


is a polynomial of degree n ~ 1. To determine the relationship between the co- 
efficients of the deflated polynomial g and the original polynomial p, expand the 
product (x — x*)q(z) to obtain 


by 12" + (bp2 — bn-12") 2? —+ + (bp_3 — bn_ge*) a? ++ + (by — b) 2") B — box”. 


Equating coefficients on likes powers of x between this last expression and the 
polynomial p yields 


bn-1 = an and (1) 
by = Ona + be412", k =n-2,n—3,n-4,...,0. (2) 


The algorithm given by equations (1) and (2) is commonly known as synthetic 
diviston. 


EXAMPLE 2.18 Polynomial Deflation in Action 


To demonstrate both the synthetic division algorithm and the role of deflation in 
the polynomial rootfinding process, consider the fourth-degree polynomial 


p(a) = a4 + 229 + 42? — Qe -5, 


Upon inspection, we note that the sum of the coefficients of p is zero, meaning that 
p(1) = 0. Thus, x* = 1 is a root. Applying synthetic division to p with z2* = 1 
yields 


b3 = a4 = 1; 

bo = a3 + bga* = 241-1=3; 

by = 02 + boe* =4+3-1=7; and 
bo =a, + b)2* =—-24+7-1=5. 


Therefore, p(x) = (x — 1)q(a), where g(x) = 2° + 347 + 7x +5. 


128 Chapter 2 Rootfinding 


By trial and error, we find that q(—1) = 0, so z* = —1 is a root of q, and 
hence also of p. Applying synthetic division to g with 2* = —1 yields 


bg = a3 = 1; 
b) = a2 + box* =34+1(-1) =2; and 
bo = a, +byz* = 74+2(-1) =5. 


Therefore, g(x) = (x + 1)r(x), where r(x) = 2? + 2x +5. 

Finally, because r is a quadratic polynomial, we can use the quadratic formula. 
This produces the complex conjugate roots -1+2%. Bringing all of this information 
together, we see that the polynomial p(x) = zt + 2a% + 4x? — 22 — 5 has a pair of 
distinct real roots, +1, and a complex conjugate pair of roots, —1 + 2i. 


Unlike the previous example, in which exact roots were determined at each 
stage, the roots in practical problems will only be calculated to finite precision. The 
use of an approximate root in the synthetic division procedure will then introduce 
inaccuracies into the coefficients of the deflated polynomial. For many polynomials, 
this will not present a serious problem; however, there are polynomials that are 
extremely ill conditioned. The roots of these polynomials can be very sensitive to 
changes in the coefficients. Consequently, deflation based on an approximate root 
can lead to degraded accuracy in subsequently determined roots. 

To reduce the effect of deflation induced errors, we can treat the roots ob- 
tained from deflated polynomials as merely initial estimates for the roots of the 
original polynomial. These initial estimates can then be refined, or polished, by 
applying a rootfinding technique to the original polynomial. This refinement pro- 
cess is not without its pitfalls, though. It is possible for two roots obtained from 
different deflated polynomials to converge to the same root of the original poly- 
nomial, thereby producing a spurious multiplicity. For a more detailed analysis 
and discussion of the deflation and refinement processes, consult Wilkinson [1] and 
Peters and Wilkinson [2]. 


Laguerre's Method 


Any of the basic rootfinding techniques discussed earlier in this chapter could be 
used at the heart of a polynomial rootfinding algorithm. We will, however, base 
our algorithm on a technique known as Laguerre’s method, which is specifically 
designed for use with polynomials. This technique requires only one starting value 
and is guaranteed to converge to a root from any starting value. For simple roots, 
convergence is cubic. Laguerre’s method can also produce an approximation to a 
complex root from a real starting value. 

So how does Laguerre’s method work? The basic process is iterative in nature. 
Let @ denote the current approximation to a root of the polynomial 


p(z) = ee — 1)(x — &2)+-: (@— fn). 
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Here, c is some constant. The starting approximation is often taken to be zero so 
as to favor convergence toward the root of smallest magnitude. Next, consider the 
functions 


dln |p(z)| _ 2 
G = = 
(x) a ; 
1 1 1 
= oh mo 
Lo Ly L—-— Xo L-Xn 


and 


H(z) = -2 BPO Gee 


da? 
i 1 1 
~ Gone @-mP "Ga 
Now, make the following set of assumptions. First, assume that the current ap- 
proximation is some distance a from the root 2;. Second, assume that all other 
roots are a distance b from the approximation. In other words, assume 


E-2Z,=a but £-2;=b forall yj Ai. 
Evaluating G and A at 2 and taking into account the above expressions then yields 


1 n-1i 1 n—1 


The solution of these equations for a is 


n 
G+ J/(n- Din — G)’ 
where the sign in front of the radical should be chosen to make the magnitude of 
the denominator as large as possible. Finally, replace by  — a and repeat. The 
iteration is terminated when the magnitude of a falls below a specified convergence 
tolerance. 

Each iteration of Laguerre’s method requires the evaluation of p, p’ and p”. 
Fortunately, these three function values can be calculated simultaneously, without 
ever having to explicitly compute either of the derivative functions. The key obser- 
vation is that with one extra calculation, synthetic division will produce the value 
of p at « = z*. In particular, consider the quantity ap + boz*. Working backward 
through the coefficients computed by the synthetic division algorithm, it follows 
that 


a 


ao + bon" = a9 + (@; + by 2*)x* = ag +012" +0, (x*)? 


ag + a,2* + (a + box") (x*)? =d9 + a,x* +49 (x*)? + be (2*)? 


|| 


“+ Gn-1 (x*)" they Git 
-+n-1 (2*)"* + On (2*)” = pla"). 


= ao + yz" + ao (z*) 


il 


oo sats 
Qo + a,x" +49 (ay gees 
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Now let’s focus on the evaluation of the derivatives of p. While comput- 
ing p({z*), synthetic division produces the coefficients of a polynomial g(x) such 
that 

p(x) = (x — x*)g{x) + plz"). 


Taking the first derivative of this equation and evaluating at x = a” gives p'(x*) = 
q(z*). Hence, the value of p'(z*) can be obtained by applying synthetic division 
to the coefficients of g as they are computed during the evaluation of p. Of course, 
this process will not only determine the value of g(x*), it will also determine the 
coefficients of a polynomial r(x) for which 


g(a) = (2 — e*)r(x) +(x"), 
or, equivalently, for which 
p(z) = (x — 2*)Pr(x) + (aw ~ 2" )q(2") + plz"). 


Taking two derivatives of this latter equation and substituting 2 = z* shows that 
p'(#*) = 2r(x*). Therefore, the second derivative of p can be evaluated by apply- 
ing synthetic division to the coefficients of r, Bringing all of this information to- 
gether suggests the following algorithm for simultaneously computing p(z*), 7’ (a*) 
and p"(x*): 


pi=On,g:=7:=90 
for I from n — t downto 0 do 


risr-xut+g 

g:=q:e+p 

pr=pr+a; 
p= q,p" = ar 


a 


EXAMPLE 2.19 Laguerre’s Method in Action 


To demonstrate Laguerre’s method, consider the polynomial 
p(x) = 1624 + 70x23 — 16927 — 5802 + 75. 


Let’s take 9 = 0 as an initial approximation and ¢ = 5 x 10731 as a convergence 
tolerance. Evaluating p, p’ and p” at x = xp yields 


p(to) = 75, p'(xo) = —580 and p" (xo) = -338. 
With these values we now calculate 


i 
P’(#o) _ > = —7.7333333333; and 


G= 


i 2 HW 
H (Ee) _ P"(o) _ 4 3111111111. 
p(£o) 
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Since G is negative, we choose the negative sign in front of the radical in the 
denominator of the formula for a. With n = 4. we find 


4 
@ = ——_—_____. = ~0),12472343153714; 
G=,/3(Gn = C2) 


therefore, 
2 = 2 — a = 0.12472343153714. 


Because |a| > ¢, we perform another iteration. Evaluating p, p’, and p” at 
“= 2, yields 


p(zi) = 0.1711418588, p’(z1) = —618.1876560151, and 


p" (a1) = —282.6294193545. 


These values then lead to 
G = ~3.6155129190 x 10°, H = 1.3073585101 x 107 


and 
a = —2.7656845983 x 104. 


Thus, 
tq = 21 — a = 0.12499999999697. 


Once again, |a| > €, so we perform a third iteration. Evaluating p, p’, and p” 
at © = £2 yields 


p(x2z) = 1.8773818056 x 107°, p(x) = —618.8437499991, and 


p" (x2) = —282.5000000014. 


From here, we calculate 
G = —3.2963127061 x 10, H = 1.0865677457 x 107° 


and 
a = ~-3.0336927626 x 107}?. 


Therefore, 
23 = 22 — @ = 0.12500000000000. 


Since |a| is now less than ¢, we terminate the iteration and accept x3 as an approx- 
imation to one of the roots of p. The value of x3 is correct to al] digits shown. 
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Laguerre’s method will locate one root of a polynomial]. To approximate all 
of the roots of a polynomial, we proceed as follows. After a root is determined, 
that root is used to deflate the polynomial. If a complex root is found, then the 
polynomial is deflated by both that root and its complex conjugate. This process 
(locate a root and deflate) is repeated until either one or two roots remain to 
be found. if one root remains, then the original polynomial has been reduced to 
a linear function, whose reot may be found directly; if two roots remain, then 
the original polynomial has been reduced to a quadratic function, to which the 
quadratic formula can be apphed. 


EXAMPLE 2.20 Finding All of the Roots 


Above, we found that z} = 0.12500000000000 was a root of the fourth-degree 
polynomial 
p(x) = 1624 + 702? — 16927 — 5802 + 75. 


Using this root to deflate p yields the third-degree polynomial 
g(x) = 1627 + 7227 — 160x — 600. 


If we now apply Laguerre’s method to g with the same starting approximation 
and convergence tolerance as used earlier, five iterations produce the root 23 = 
—2.50000000000000. Deflating ¢ with this root leaves the quadratic 


r(z) = 1627 + 32a — 240. 


From the quadratic formula we obtain the final two roots: 13 = 3.00000000000000 
and t%4 = —5.00000000000000. All roots are correct to the digits shown. 


a 


EXAMPLE 2.21 Two More Polynomials 
Consider the fifth-degree polynomial 


pla) = 2° — 3.424 + 5.453123 — 4.207227 + 1.509242 — 0.20304. 


Using a convergence tolerance of 5 x 1077! and a starting approximation of 0 for 
each application of Laguerre’s method, the roots of this polynomial were found to 
be 


x} = 0,44999999975510 
xy = 0.47000000070755 
x3 = 0.47999999953778 
v4 = 0.99999999999978 + 1.000000000000022 
x2 = 0.99999999999978 — 1.00000000000002+ 
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The first two roots were obtained after six iterations each, and the third root was 
obtained after four iterations. The complex conjugate pair was obtained from the 
quadratic formula. The exact roots of this polynomial are 0.45, 0.47, 0.48, and 
Lb, 

Finally, consider the polynomial 


p(x) = 4a* — 92° + 3a? + 5a — 3, 


which has a simple root at x = —0.75 and a root of multplicity 3 at 2 = 1. With the 
same convergence tolerance and starting approximation as used above, the following 
estimates for the roots were obtained: 


xt = 0.99988146132218 

a} = 1,00005926933960 + 0.000102652870514 
2% = 1.00005926933960 — 0.000102652870514 
2% = —0.75000000000139. 


Ten iterations were needed to achieve convergence for the first root and eleven for 
the complex conjugate pair. As might have been expected, the algorithm has no 
problem estimating the simple root at 2 = —0.75, but has difficulty estimating the 
triple root at x = 1. In fact, the algorithm does not find three real roots, but rather 
one real root and a complex conjugate pair tightly clustered around x = 1. 


An Application Problem: Chemical Equilibrium 


One mole of nitrogen gas and one mole of hydrogen gas are injected into a one liter 
reaction chamber. The temperature within the chamber is maintained at 1000 K, 
and the reversible reaction 

No + 3H2 = 2NH3 


is allowed to proceed to equilibrium. If the equilibrium constant for the indicated 
reaction is k = 2.37 x 107° at 1000 K, how much ammonia (NHs) is present at 
equilibrium? 

The equilibrium constant for a reversible reaction is given by the product of 
the concentrations of the substances that appear on the right side of the reaction 
equation divided by the concentrations of the substances that appear on the left 

‘side. Each of these concentrations is raised to the power of the coefficient of that 
substance in the reaction equation. For this reaction, then, 


NH3|* 
k= 237x198 = ol 
[No] [F2]* 
where [ - | denotes the concentration of the indicated substance. 


Let’s assume that at equilibrium there are x moles/liter of NH3: that is, 
[NH3] = x. Since the reaction chamber has a volume of one liter, this means that 
z moles of NH3 are present. From the reaction equation, it is seen that for every 
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two moles of NH3 produced, one mole of Ne is used. Thus, to produce z moles of 
NHs3, 2/2 moles of N2 must have reacted; hence, at equilibrium, 1 - § moles of Nz 
remain. In other words, [N2] = 1 — % at equilibrium. By a similar argument, it 
follows that [Hz] = 1 — a at equilibrium. Substituting these concentrations into 
the equilibrium constant expression yields 


x 


(1-3) 0-¥)" 


This last equation can be rearranged into the form 


2.37 x 10% = 


3.9993752* — 15.99752° — 978.6727 — 11.852 + 2.37 = 0. 


The roots of this equation are + = 0.04351, 2 = --0.05566, x = —13.76348 and 
xz = 17.77563. The two negative roots must be discarded since the concentration of 
ammonia cannot be negative. The root « = 17.77563 must also be discarded since 
this would correspond to a negative concentration of both nitrogen and hydrogen. 
Therefore, at equilibrium, there are 0.04351 moles/liter of NH3 present. 


Other Polynomial Rootfinding Schemes 


The Jenkins-Traub method [3,4,5] is widely used in software libraries for polynomial 
rootfinding. This method is best described as a polyalgorithm, combining multi- 
ple schemes to produce a robust and efficient computational procedure. Details 
can be found in Householder [6] and Ralston and Rabinowitz [7]. The Lehmer- 
Schur algorithm generalizes the one-dimensional notion of bracketing and isolates 
roots inside circles in the complex plane. Consult Acton [8] for an introduction 
and Householder [6] for further details. For polynomials with real coefficients, 
Bairstow’s method seeks quadratic factors. This avoids the need for complex arith- 
metic. A completely different approach to polynomial rootfinding is to formulate 
the rootfinding problem as a matrix eigenvalue problem. Here, a matrix is first, con- 
structed whose eigenvalues are the roots of the given polynomial. Then numerical 
techniques to locate the eigenvalues of the matrix are applied. 
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EXERCISES 
1. Use synthetic division to deflate the given polynomial by the indicated root. 
(a) p(x) = 2* — 2.2527 — 25.752? + 28.52 + 126, a 8 
(b) p(x) = 24 +1.832° — 0.08127 + 1.832 — 1.081, gt = -2.3 
(c) pla) = 2* + 20.529 + 129.5a? + 230a — 150, x =0.5 


2. Apply Laguerre’s method to each of the following polynomials with a starting 
approximation of zo = 0 and a, convergence tolerance of 5 x 107'?. 
(a) p(x) = 23 — 42? — 32 +5 
(b) p(x) = 2° — 7x? + 14¢ —6 
(c) p(x) = a4 + 20.529 + 129.5a? + 2302 ~ 150 
(d) p(x) = 24 — 229 — 5a? + 122 — 5 


3. Construct an algorithm to deflate the nth-degree polynomial 


p(2) = an2” + On—10" |) + ange 4 aye tan 


by the quadratic factor z* + ax + 2; that is, find the polynomial 


4 


az) = bao” 2 + bn gt + bya” FH + bit + bq 


such that p(x) = (27 + aa + B)q(zx). 

4, Determine all roots for each of the following polynomials. Use a convergence 
tolerance of 5 x 107}. 
(a) pla) = 20° — 624 4 Be? 4 2? +2 

(b) p(x) = —328 + 29 + 10x - 1 

(c) plz) = 29 + 2° - O24 — 80? + 29e? — 4a + 4 

(d) plz) = 24 +523 +72? 41 

(e) p(x) = 16x* — 402? + 52? + 202 + 6 

(f£) p(x) = 102° ~ 8.32? + 2.2952 — 0.21141 


5. The Chebyshev polynomials, T;(z), are a special class of functions. They satisfy 
the two-term recurrence relation 


Titi() = 2eT;(x) — Tr1(2) 


with To(z) = 1 and Ti(z) =z. 
(a) Using the recurrence relation, determine the formula for T(z). 
(b) Locate all roots of Te{z). 
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6. 


10. 


The Hermite polynomials, Hi(x), are a special class of functions. They satisfy 
the two-term recurrence relation 


Hi41(2) = 22H;(xz) — 22¢Hj-1(2) 


with Ho(z) = 1 and Hi(2) =z. 
(a) Using the recurrence relation, determine the formula for Hs(c). 
(b) Locate all roots of H5(z). 


. The Laguerre polynomials, £;(z), are a special class of functions. They satisfy 


the two-term recurrence relation 
Lisle) = (14 2i-2z)£Lilz) - PLi-1(e) 


with £o(z) = 1 and £i(s} = 1-2. 
(a) Using the recurrence relation, determine the formula for La(z). 
(b) Locate all roots of £4{x). 


. The Legendre polynomials, P;(x), are a special class of functions. They satisfy 


the two-term recurrence relation 


w+1 
Pip i(t) = ei &P,(z) — 
with Po(r) = 1 and Pi (x) = x. 
(a) Using the recurrence relation, determine the formula for Ps (zx). 
(b) Locate all roots of P5(z). 


—— P;- (x) 


. The concentration, C, of a certain chemical in the bloodstream ¢ hours after 


injection into muscle tissue is given by 


_ 3? +t 
~ 60+ ¢3° 


At what time is the concentration greatest? 

DeSanti (“A Model for Predicting Aircraft Altitude Loss in a Pull-Up from a 
Dive,” SIAM Review, 30 (4), 625~628, 1988) develops the following relationship 
for the ratio between the final velocity, Vy, and the initial velocity, Vo, for an 
aircraft executing a pull-up from a dive: 


3 
1 Vy Vy 1 
Ln (Rl ee ae eee) 
(%) Go ee 


yo is the initial fight path angle and B = g/ (kV), where g is the acceleration 
due to gravity and k is related to the coefficient of lift. The altitude loss during 
the pull-up can be determined from the ratio V;/Vo using the equation 


1 — (Vs/Vo)? 
AY= hE 
Determine the altitude loss associated with each of the following sets of system 


parameters (take g = 9.8 m/s*): 
(a) Vo = 100 m/s, yo = —30°, k = 0.00196 m7" 


11. 


12. 


13. 


14. 
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(b) Vo = 150 m/s, yo = —10°, & = 0.00145 m7! 
(c) Vo = 200 m/s, yo = —45°, & = 0.00128 m7? 
(a) Vo = 250 m/s, yo = —30°, & = 0.00112 m7! 
In determining the minimum cushion pressure needed to break a given thickness 
of ice using an air cushion vehicle, Muller (“Ice Breaking with an Air Cushion 


Vehicle,” in Mathematical Modeling: Classroom Notes in Applied Mathematics, 
M. S. Klamkin, editor, SIAM, 1987) derived the equation 


2 224 2a\3 
p(l— 6) + (ono? - >) p+ : po & ) =0, 


3r4 3r2 


where p denotes the cushion pressure, A the thickness of the ice field, r the size 
of the air cushion, o the tensile strength of the ice, and ( is related to the width 
of the ice wedge. Taking 6 = 0.5, r = 40 feet, and o = 150 pounds per square 
inch (psi), determine the cushion pressure needed to break a sheet of ice 6 feet 
thick. 


Determine the roots of the polynomials 

P(a) = (2 — 1)(x — 2)(« — 3)(a — 4)(x — 5)(a — 6)(x — 7)(x — 8)(w — 9)(x — 10) 
and 

P(x) = (w@—1)(w—2)(w—3)(a—4)(a —5)(a —6)(x — 7) (2-8) (a —9)(a— 10) 42° 


with Laguerre’s method as the central rootfinding scheme. Apply a convergence 
tolerance of 5 x 10771, and take 0 as the initial approximation. 


One mole of H2S is injected into a two liter reaction chamber, and the reversible 


reaction 
2H2S = 2H2 + Se 


is allowed to proceed to equilibrium. If the equilibrium constant for the indicated 
reaction is k = 0.016, how much Ho and Sz are present at equilibrium? 


The reversible reaction 
2502 + O2 = 2803 


is allowed to proceed to equilibrium in a one liter reaction chamber. If 0.012 
moles of SO2 and 0.0076 moles of O2 are initially present and the equilibrium 
constant for the indicated reaction is k = 44.643, how much SO3 is present at 
equilibrium? 


CHAPTER 3 


Systems of Equations 


AN OVERVIEW 
Fundamental Mathematical Problems 


In this chapter we will discuss the general mathematical problems associated with 
the simultaneous solution of systems of n algebraic equations in n unknowns. These 
problems take one of two possible forms: 


Linear Systems of Equations 

Given a nonsingular n x 1 matrix A and an n-vector b, determine the 
n-vector x that satisfies the equation Ax = b. 

Nonlinear Systems of Equations 

Given a function F : R® > R”, find an n-vector x such that F(x) = 0. 


Techniques for solving linear systems of equations separate into two classes—direct 
techniques and iterative techniques—whereas nonlinear systems are solved exclu- 
sively with iterative methods. 

Direct techniques produce a solution to the system of equations in a fixed 
number of steps. The solution obtained by these techniques would be exact except 
for roundoff error. Iterative techniques, on the other hand, generate a sequence 
of approximations which converge toward the true solution. The amount of work 
required by these techniques therefore depends on the specific problem being at- 
tacked and on the choice of a starting approximation. When working with iterative 
methods, both the conditions under which and the speed with which the resulting 
sequence converges must be explored. 

Here are a few applications which give rise to a system of algebraic equations. 


Forces in a Plane Truss 


Consider the statically determinate plane truss shown in Figure 3.1. The structure 
is pinned to a stationary support at the upper night and supported by a roller at 
the lower left. Furthermore, the structure is subjected to a 3-kN (kilo-Newton) 
horizontal force at the lower right joint and to a 2-kN force acting at a 45° angle to 
the horizontal at the upper left joint. Our objective is to determine the resulting 
forces within the members and the reaction forces at the stationary support and 
the roller. 

Assuming each force acts in the direction indicated in the diagram, balancing 
the horizontal and vertical components at each joint provides eight simultaneous 
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}t— 31/4 ——®{<t— 3/4 —>}e— 31/4 —>| 


Figure 3.1 Forces acting on a statically determinate truss 


linear equations for the eight unknowns. The table below summarizes the equations 
in this system. 


Joint Horizontal Component Vertical Component 
lower left 3F, +P) =0 qf + Fr =0 
lower right F,—-3F;+32F;-3=0 $F3+4F,=0 
upper left $F, -32F+Fy+V2=0 37, +3h-V2=0 
upper right Fy + 2B; - Fx =0 35 —- Fy =0 


Multistage Chemical Extraction 


Figure 3.2 shows a schematic for an n-stage countercurrent chemical extraction 
reactor. Water containing a mass fraction v;, of a certain chemical enters at one 
end of the reactor, while a solvent containing a mass fraction yj, of the same 
chemical enters at the other end. The water stream has a mass flow rate of W, 
and the solvent stream has a mass flow rate of S. As the streams move from stage 
to stage, the chemical is extracted from the water and transferred to the solvent. 
Given values for Zin, Yin, W, S, and n, find the mass fraction of the chemical 
leaving each stage of the reactor in both the water and solvent streams. 

A material balance around stage # (for 7 = 2,3,4,...,%—1) yields the equation 


Wajz-1 + Sysi = Wap + Sy. (1) 
At equilibrium, the linear relationship 
Yi = MI; (2) 


is assumed to hold at each stage of the reactor, where m is a constant that depends 
on the chemical being extracted and the solvent. Substituting (2) into (1) and 
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Chemical-rich Chemical-rich 
water solvent 
Chemical-lean Chemical-lean 
water solvent 


Figure 3.2 Schematic of an n-stage countercurrent chemical extrac- 
tion reactor. 


rearranging terms gives 
Waj-) — (W + Sm)a; + Smaj41 = 0. (3) 
Performing material balances around stage 1 and stage n produces the equations 
—(W + Sm)zx; + Smz2 = —Wain (4) 


and 
Wan-1 —(W4+ Sm)an = —S¥in, (5) 
respectively. Equations (3), (4), and (5) form a complete system of n linear equa- 


tions for the x,, the mass fractions in the water stream. After solving this system, 
the mass fractions in the solvent stream, the y;, are obtained from (2). 
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Coupled Reversible Chemical Reactions 


Consider the coupled pair of reversible chemical reactions 


2QA4+ BC 
AtD=C, 


where A, B, C, and D represent certain chemical compounds. Suppose Ag moles 
of chemical A, By, moles of B and Do moles of D are injected into a 1-liter reaction 
chamber and the above reactions are allowed to proceed to equilibrium. If k; and 
ko are the equilibrium constants of the first and second reactions, respectively, then 
how many moles of C are present at equilibrium? 

Let c, denote the number of moles of C produced by the first reaction and cz 
denote the number of moles of C’ produced by the second reaction. From the 
equation for the first reaction, we see that to produce c; moles of C, c; moles of B, 
and 2c, moles of A must have reacted. For the second reaction to produce cz moles 
of C, cg moles of both A and D must have reacted. Therefore, at equlibrium, there 
will be Ag — 2c, — cg moles of A, By —c, moles of B, c, +c. moles of C, and Do — ce 
moles of D present. 

An equilibrium constant measures the ratio of the concentrations of products 

_to reactants, each raised to the power of their respective coefficients in the chemical 
equation. Thus 


IC] [C) 
i= — and k=, 
* TAP(B] [AID 
where [ - | denotes the concentration of the indicated chemical. Since the reaction 


chamber has a volume of one liter, it follows that at equilibrium [A] = Ao — 2c; — cz 
moles/liter, [B] = Bo —c, moles/liter, [C] = c, + c2 moles/liter, and [D] = Dp — ce 
moles/liter. Substituting these concentrations into the expressions for the two 
equilibrium constants yields the system of nonlinear equations 


cy +2 
(Ag — 21 — €2)*( Bg — ¢1) 


Cy +¢2 
(Ag = 2Cy = c2)\(Do = 2) 


k= and = ky = 


For given values of the parameters Ag, Bo, Do, ki, and kz, we need to solve this 
system for c, and cg. 


Remainder of Chapter 


In Section 1 the basic algorithm that will be used for the solution of systems of linear 
algebraic equations, Gaussian elimination, will be introduced. A careful count of 
the number of operations required by this algorithm will also be presented. To 
reduce the effect of roundoff error on the computed solution, pivoting strategies 
will be considered in the next section. Section 3 then introduces the concepts 
of vector and matrix norms, while Section 4 presents estimates for the error in 
the computed solution. The concept of an LU decomposition and its place in the 
practical implementation of a solution algorithm will be discussed in Section 5. 
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Coverage of direct techniques for linear systems concludes with a discussion of 
direct factorization techniques in Section 6 and a discussion of solution algorithms 
for matrices of special structure in Section 7. Sections 8 and 9 then present iterative 
techniques for linear systems, with the basic concepts and the classical methods 
covered in Section 8 and the conjugate gradient method discussed in Section 9. 
The solution of nonlinear systems of equations is treated in the final section. 


3.0 LINEAR ALGEBRA REVIEW 


In this section, we will review several definitions and concepts of linear algebra. 
The focus will be on topics needed throughout the remainder of the chapter. For a 
more detailed review of this material, consult a standard linear algebra textbook, 
such as Lay [1], Leon [2|, or Shifrin and Adams {3}. 


Matrices 


A matrix is one of the most important tools in linear algebra. 


Definition. A MATRIX is a rectangular collection of numbers arranged in 
rows and columns. 


A matrix with n rows and m columns is said to be of dimension n x m, which 
is read “n by m.” Standard notation is to use a capital letter, such as A, to denote 
a matrix, and the corresponding lowercase letter with two subscripts, as in a;;, to 
denote the elements in the matrix. The first subscript indicates the row, and the 
second indicates the column. For example, 


2 <4 4 
a=| J, 0 : 


is a 2 x 3 matrix, with 


y= 2; O2=~-45 a=), 
a9,=—-6; ae2=0; aog = 8. 


A matrix with one column is called a column vector, while a matrix with a 
single row is called a row vector. Vectors will be denoted by a lowercase letter in 
boldface, such as x. The elements in a vector will be denoted by the same lowercase 
letter, not in boldface, with a single subscript, as in z;. When we wish to indicate 
that a specific vector has n elements, we will refer to the vector as an n-vector. 

A matrix that has the same number of rows as columns is called a square 
matrix. Suppose A is an n x n square matrix. The elements @);, G22, 033, «+1 Gun 
are called the diagonal elements of A. All other elements are referred to as off- 
diagonal elements. If all of the off-digonal elements of the square matrix A are zero 
(ie., ai; = 0 for i # 7), then A is called a diagonal matriz. An nxn diagonal matrix 
whose diagonal elements are all equal to 1 is called the zdentity matriz, which is 
denoted by J,. When the context makes the dimension of the matrix clear, the 
subscript n is generally omitted and the identity matrix is simply denoted as J. 
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Operations on Matrices 


In order to work with equations involving matrices, we need to define some basic 

matrix concepts and operations. The first topic we consider is matrix equality. 
Definition. Two n x m matrices A and B are EQUAL if a;; = b;; for each i 
and j. 


Note the role that the dimension of the matrices plays in this definition. Thus, 
even though every element in the matrix 


—2 6 
Ala | 
is equal to the corresponding element in the matrix 
-2 6 5 Q 
B=] 1 38 0 -3 |, 
0 12 3 


these two matrices are not equal because they have different dimensions. 
The two most basic algebraic operations involving matrices are matrix addi- 
tion and scalar multiplication. These are defined as follows. 


Definition. The SuM of two n x m matrices A and B is an n x m matrix 


C = A+B whose elements are given by c,; = aj; + bj; for each 7 and 7. 


Definition. Let A be an nx m matrix and a be a real number. The SCALAR 
MULTIPLICATION of a and A is ann Xm matrix C = aA whose elements are 
given by ¢j; = aGzy. 


EXAMPLE 3.1 Matrix Addition and Scalar Multiplication 
Let 


2 1 —-4 1 
A=/1 1 and B= 30-1 
2 3 —2 1 
Then 
2(2) 2(1) 4 2 i 3 
2A= | 2(1) 20) |} =] 2 2 and -~3B= | -9 3 
2(2) 2(3) 4 6 6 ~3 
Further, 
P44 (24) ea “O° 
A+B= 14+3 14+(-1) |} = 4 0 
2+(-2) 341 0 4 
’ and 
16 -1 
2A-38B=2A+(-3)B=| -7 5 
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Another important matrix operation is the transpose. 


Definition. The TRANSPOSE of an n x m matrix A, denoted by AT, is an 
m Xn matrix whose ith row is the 7th column of A for each i. 


For example, if 


2 1 
A=|1 1], 
2 3 
then 
21 2 
PAG 4 | 
For the matrix 
4 -1 0 
Besih ah 2. deo 
0 -2 4 


note that 


0 -2 4 


A square matrix A is called symmetric if AT = A. 
The final basic matrix operation that we will discuss is matrix multiplication. 


Definition. Let A be an n x m matrix and B be a mx p matrix. The 
PRODUCT of A and B is an n x p matrix C whose elements are defined by 


m 
Cy = » Qik Djs 
k=) 


for each i and j. 
Note that the number of columns in the first matrix must equal the number 
of rows in the second matrix for the matrix product to be defined. 


EXAMPLE 3.2 Matrix Multiplication 
Let 


2 -6 
11 3 -l 
A=/-4 0], B= , and C= 
a2] ef 3] e-[S 7] 
Then 
(2)(1) + (-6)(2) (2)(1) + (-6)(3) 
AB = |} (-4)(1) + (0)(2) (—4)(1) + (0)() 
(1)(1) + (5)(2) (1)1) + (5)(3) 
—-10 -16 
= -4 -4 |, 
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and 
18 -8 
AG] | 512-21, 
—7 4 


but the products BA and C’A are not defined because the number of columns in B 
and C’ is not equal to the number of rows in A. 
Finally, 


1 0 
ac=cB=| j ae 


The Inverse Matrix 


In the last example, we saw two square matrices whose product was the identity 
matrix. Square matrices with this property are called inverses. 


Definition. Let A be an n X n matrix. If there exists ann x n matrix B 
such that BA = AB =, then the matrix B is called the INVERSE of A and 
is denoted by Aq}. 


Not all square matrices have inverses. For example, consider the 2 x 2 matrix 
1 0 
ie | i | 


— | Oy bye — | di dy 
Pel mal then AB=| 0 of: 


If 


From here, we see that no choice of the elements bj), by2, be; and bg2 will result 
in the product AB being equal to I. Hence, A does not have an inverse. Matrices 
that do not have an inverse are called singular, whereas matrices that do have an 
imverse are called nonsingwlar. 

The following theorem lists several important results regarding inverses. The 
proofs will be considered in the exercises. 


Theorem. Let A be a nonsingular matrix. Then 
1. AW? is unique; 
A7} is nonsingular and (A7!)7) = A; 
AT is nonsingular and (A?)~! = (A7!)"; and 
4. If B is nonsingular, then AB is nonsingular and (AB)! = B-'A7}. 


The Determinant 


Associated with every square matrix is a real number called the determinant of A, 
which we shall denote by det(A). 
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Definition. The DETERMINANT of a square matrix A is defined recursively 
as follows. 


1. If Ais the 1 x 1 matrix (@1;|, then det(A) = ay. 
2. If A is an 2 x 7 matrix with n > 1, then 


det(A) = s ai;(-1)** m,,; 
jal 


for any choice of the row 7, or 


nr 


det(A) = So a4; (-1) mi 


i=l 


for any choice of the column j, where ™m,; is the determinant of the 
(n — 1) x (n—1) matrix obtained by deleting the ith row and the jth 
column from A. 


Each mi; is called a minor of A, and the expression (—1)**t3m;; is called the 
cofactor associated with a,;;. The procedure for calculating determinants given by 
this definition is therefore known as expansion by cofactors. 


The second statement in this definition provides 2n different ways to calculate 
a determinant, depending on the row or column chosen for the cofactor expansion. 
However, all expansions will lead to the same numerical value. We can therefore 
use the flexibility of the definition to our advantage by choosing to expand along 
the row or column with the most zero elements. 


EXAMPLE 3.3 Calculating a Determinant 


Consider the 4 x 4 matrix 


tO: “A 
eae ee 
AS [HG 4g) 8 
a or | 


Because the third row contains only one non-zero element, we choose to expand 
along the third row. Since a3, = @32 = a33 = 0, 


det(A) = a34(—1)3*4mg4 = —2det | | -2 1 -3 
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Expanding along the second column of this new matrix, we find 


1 0 4 
det -2 1 -3 
3.2 =I 


& 1(-? ae (| : : }) + 2(-1)°*? det (| aa Bs I) 


= (YM -(3)(4)] -21 
+) a =36)=-30 


~ RP 
(j= 
nae 
oN 
w 
a 
NN 
iw} 
a 
a 
i 
a 


where we expanded along the first column of each of the 2 x 2 matrices. Therefore, 
det(A) = —2(—21) = 42. 


The following theorem presents several important properties of determinants. 
The proof of this theorem can be found in the linear algebra texts cited in the 
references below. 


Theorem. Let 4 be an n x n matrix. 

1. If A has 2 row or column consisting only of zero entries, then det(A) = 0; 
If A bas two rows the same or two columns the same, then det(A) = 0 

. det(A’) = det(A); 

. if A is nonsingular, then det(A~+) = (det(A))~?; 

, If Bis ann x n matrix, then det(AB) = det(A) det(B). 


) 


We close our linear algebra review with a theorem which links the concepts 
of nonsingular matrices, determinants and solutions of linear systems of equations. 
This theorem will be extremely important throughout the remainder of this chapter. 
For a proof, consult one of the texts listed below. 


Theorem. For any n x n matrix A, the following statements are equivalent: 


1. A is nonsingular; 

2. det(A) #4 0; 

3. The equation Ax = 0 has the unique solution x = 0; and 
4 


. The equation Ax = b has a unique solution for any right-hand side 
vector b. 
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EXERCISES 


In Exercises 1-9, compute the indicated matrices given 


2 1 0 4 2 
fel ee -3 -1 51, C=|3 -1/], 
1 3 4 2 -4 
1 -1 4 
and D=/0 2 -2]. 
0 0 8 


If an operation cannot be performed, indicate why not. 


1. (a) 244+C7 (b) C-3B 
2. (a) AB (b) AD 

3. (a) CA (b) AC 

4, (a) BD (b) DB 

5. (a) BC (b) CB 

6. (a) 3B—2D (b) 2D7 +B 
7. (a) det(D) (b) det(A) 
8. (a) CTD (b) BA? 

9. (a) -2AT +50 (b) B? +D 
10. Let A be a nonsingular matrix. 


(a) Show that Aq? is unique. 

(b) Show that A7! is nonsingular and (A7?)7! = A. 

(c) Show that A? is nonsingular and (A7)7! = (472), 

(d) If B is nonsingular, show that AB is nonsingular and (AB)? = B71A7. 
11. Can an n X m matrix with n 4 m be symmetric? Explain. 
12. Recalculate the determinant of the matrix 


i 0 4 ] 
21-3 2 
BN a Age aS 
ae oe ee 


by first expanding along the second column. 


13. Show that 
det (| es }) = @11422 — 012421. 
a21  @22 


a a 
A= 1k 12 | 
a2] 422 


(a) Show that A is nonsingular provided a11a22 — aiza21 # 0. 


14. Let 
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(b) If a11022 ~ @12@2, #0, show that 
Ark 1 a22 a2 
431022 — 412021 | —G21 411 |” 


15. Let D be an n x n diagonal matrix. Show that det(D) = dydzoda3--- dan. 
16. Let a be a real number and Jet 
2 a@ 0 
=o) Sb ob |i, 
a 


a=($ | and B= 
1 3 


(a) For what value(s) of a is A singular? 
(b) For what value(s) of o is B singular? 


3.1 GAUSSIAN ELIMINATION 


In this chapter we study techniques for the solution of systems of linear algebraic 
equations. The most general system of 7 linear equations in n unknowns can be 


written as 
Qy1Z1 + 12%. + a43%3 + + Gintn = dy 
Qg1Z. + Geetg + ao3%3 + +++ + Gantn = by 
Q31Z, + adgetq + a33%3° + + Q3nltn = 63 
OniZ1 + Anetq + Qn3gtg + > + Onntn = dn. 


The aj; and the 0; are known constants, and the z; are the variables. This system 
can be expressed very compactly in matrix notation as Ax = b, where A is the 
nxn matrix 


441 12 @13 * + + Gtn 
G21 @22 @23 *- * ° Gan 
G31 432, 330 Bn 
Qnl Gn2 Gn3 °° * Gann | 
; : T 
and x and b are the n-dimensional column vectors [1 22 23 - - - @n | 
T : : , ; 
and [ by bp by - - shy | , respectively. A is called the coefficient matriz, 


x the soluteon vector and b the right-hand side vector for the system. 

We will focus on the solution technique known as Gaussian elimination with 
back substitution. After a review of the basic algorithm, several examples will be 
presented to demonstrate the technique. Finally, a detailed account of the number 
of operations required to compute the solution will be given, and a comparison with 
other possible solution strategies will be made. 
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Reviewing the Basics 


The first step in the solution of a linear system of equations is to gather all the 
information needed to compute the solution (that is, the coefficients and the right- 
hand sides) into one structure, known as the augmented matric for the system. For 
a system of n equations in n unknowns, the augmented matrix will have dimensions 
nx (n+1). The first n columns are the coefficient matrix, A, for the system. The 
right-hand side vector, b, forms the last column. For the general system of linear 
equations given earlier, the augmented matrix is 


Qi, 412 413 - + + Gin | bi 
a2, G92 G93 + + + Gan | be 
a3) 432 @33 * * * G3n | 63 
@n1 Gn2 Qn3 ° > * Gann bn 


It is customary to use a vertical line to separate the two portions, coefficient and 
right-hand side, of the augmented matrix. 

The objective of Gaussian elimination is to transform the coefficient portion 
of the augmented matrix into upper triangular form. 


Definition. A matrix U is called UPPER TRIANGULAR if all elements below 
the main diagonal are zero; that is, if u;; = 0 whenever 2 > j. 


In this definition, and throughout our discussions involving matrices, we will 
assume the conventional interpretation for subscripts on matrix elements: The first 
subscript refers to the row and the second refers to the column. 

The transformation of the coefficient portion of the augmented matrix is car- 
ried out through the systematic application of three elementary row operations 
(EROs). The three operations, and the notation we will use to signify each, are 


ERO,: Any two rows can be interchanged. The notation R; - R,; 
indicates that row i was interchanged with row j. 


ERO, : Any row can be multiplied by a nonzero constant. The notation 
r; — md&, indicates that row i was multiplied by m. 


ERO3: Any multiple of one row can be added to another row. The 
notation r; — R; + mR; indicates that m times row 7 was added 
to row 7%. 


The system of equations corresponding to the matrix which results after any se- 
quence of these operations is performed is equivalent to the original system in the 
sense that it has the same solution set. As we will see below, the majority of the 
work in Gaussian elimination consists of repeatedly applying the third operation. 
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To illustrate the Gaussian elimination process, consider the system 


wy ot eg + 2 + & = 1 
2 + %@ + 2x5 + 3824 = 2 
~2i + 243, 4+ 2 = 1 
32, + 24. - 23 = 1 


We begin by placing the pivot in the first row, first column of the augmented matrix. 
In the matrices shown below, the location of the pivot is indicated by angled braces, 
(}. The pivot serves as a reference location for organizing subsequent calculations. 
The goal is to replace each element below the pivot, within the pivot coluran, with 
a zero. This can be done by performing ERO, on the rows below the pivot row, 
each time adding an appropriate multiple of the pivot row. The required multiple, 
m, is determined by the formula 

_ _ element to be replaced by zero 
eS element in pivot 
For the problem at hand, the multipliers for the second, third, and fourth rows are 
—1, +1, and ~3, respectively, The result of carrying out the corresponding row 
operations is 


ra—Re-Ry 
[ @) 1 ol 1]. ] r-8e+y 1 t 1 i 1 
1 1 2 3/2] memeram | 0 0 1 2 1 
-1 0 2 1/1 0 1 8 2 2 
3 2 -1 0) 1 0 -l1 -4 -3] -2 


Having completed one elimination pass through the matrix (generating zeros in one 
column), the pivot is moved down one row and to the right one column to set up 
for the next pass. At this point, we have a slight problem—-the pivot element is 
zero. This problem can be bypassed by locating a row below the pivot row which 
has a nonzero entry in the pivot column. Provided the original coefficient matrix 
was nonsingular, it will always be possible to find such a row. The current pivot 
row and the selected row are then interchanged. Here, we choose to interchange 
rows 2 and 3. Adding the new pivot row to the fourth row completes the second 
elimination pass. 


Bia a ae er -w- 4 1 
a @) 1 2 1 RactRy GQ) 3 2 2 
Gad Bs 2 | = O- ta aor i. 4 
eee eed Eee 


ra Bathe 
— 


oo mr O98ODH 
| 
me 
I 
Ps 
l 
Le) 
I 
ine] 


Ooorr 
= 
Or wr 
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For the third, and in this case final, pass through the matrix, the pivot is moved to 
the third row, third column. As a general rule, the number of elimination passes 
is always one less than the number of equations. By adding the third row to the 
fourth row, the transformation of the coefficient portion of the augmented matrix 
to upper triangular form is complete: 


ee ee ee a) (Us ot it 
Ok: Be 2) Deere oO a 8 89 
00 doe }4 == (a ae ae a 
0. OF at ash |p ova ae Oar WI 


To obtain the solution to the system, we are now in position to perform back 
substitution. The equation corresponding to the bottom row of the transformed 
augmented matrix contains just one variable and can be solved directly. Here, we 
find z4 = 1. This value is then substituted into the equation corresponding to the 
next to last row to give 23 +2(1) = 1, or a3 = —1. Continuing to work back 
up the matrix, the values for x4 and v3 are substituted into the second equation, 
yielding x2 + 3(—1) + 2(1) = 2, or x2 = 3. Finally, from the first equation we find 
g;+3-141=1, or 2 = —2. Collecting these four values, the solution vector is 


x=[-2 3 -1 1]. 


Application Problem 1: An Electrical Circuit 


Consider the electrical circuit shown in Figure 3.3. We would like to determine the 
currents flowing through the different branches. ‘To dea] with such circuits, we apply 
Kirchoff’s current equation and Kirchoff’s voltage equation. The current equation 
states that, at any junction, the sum of the current flowing into the junction must 
be equal to the sum of the current flowing out from the junction. The voltage 
equation states that the sum of the changes in voltage around any closed loop must 
be equal to zero. To apply the voltage equation for this problem, we will also need 
to use Ohm’s law, which states that the voltage drop across a resistor is equal to 
the product of the current and the resistance. 

Applying Kirchoff’s current equation to the junction on the right side of the 
circuit yields 

I, =Ig+]s or I, — Ip —J3 =0. 


Balancing the currents flowing into and out from the junction between the 1-2 
resistor and the 6-volt source and the junction at the bottom of the circuit gives 


the equations 
Ig -—I, ~Ig =0 


and 
fg +1 — Ig = 0, 


respectively. The equation obtained from the junction on the left side of the circuit 
is just the sum of the three previous equations, so we do not include it in the system. 
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Figure 3.3 


To obtain three more equations, we now turn to Kirchoff’s voltage equation. 
Traveling clockwise around the outermost loop of the circuit, we find that the 
current J; flows through a 2-2 resistor and the current J/g flows through a 1-2 
resistor. This produces a total voltage drop of 2Z3 + Ig, which must balance the 
7-volt increase produced by the voltage source at the top of the circuit. Hence, we 
have 


213 + Ig =7. 


Traveling clockwise around the loop at the top of the circuit and clockwise around 
the loop at the lower right yields 


Ig + 215 = 18 


and 
—In+ 2I3 - 3, = 0, 


respectively. The negative signs appear in the last equation because a clockwise 
loop travels through the 3-Q and 1-2 resistors in the opposite direction of the 
indicated current. 

The augmented matrix corresponding to this system of six equations in the 
six unknown currents is 


1 -1 -1 06 90 0 0 
0 1 0 -1 -1 0 0 
0 0 1 21 +0 -1)] 0 
0 0 2 0 0 21 7 
0 1 0 0 2 #90 13 
0 -l1 2 -3 0 O 0 
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Following Gaussian elimination, the transformed augmented matrix takes the form 


1-1 ~-1 0 0 0 0 
0 i 0 -1 -1 0 0 
0 0 1 -1 0 -1 0 
0 0 0 -2 QO 3 7 
0 0 0 0 3 1.4 16.5 


0 0 0 OQ O 65 | —15.5 


Back substitution now gives the desired currents: I, = 6.692, Iz = 4.385, I, = 
2.308, I4 = 0.0769, Js = 4.308, and Ig = 2.385. 


Application Problem 2: Input-Output Model for a Simple Economy 


Consider a simple economy that consists of four sectors: (1) agriculture; (2) energy; 
(3) manufacturing; and (4) labor. Producing output in one sector generally requires 
input from all four sectors. For example, to produce agricultural output, we likely 
will need seeds and fertilizer from the agriculture sector, farming equipment from 
the manufacturing sector, fuel to run the equipment from the energy sector, and 
farm hands from the labor sector. The relationships between the various sectors 
can be represented by an input-output matriz A, where the element ai; is defined 
as the input required from sector 7 to produce one unit of output from sector 7. 

Let the vector x denote the total output of the economy. The vector Ax 
then gives the total input needed from each sector of the economy to produce the 
output x. This quantity is known as the internal demand. If there is additionally 
an external consumer demand on each sector, denoted by the vector d, then the 
total output must be sufficient to cover both the internal and the external demand. 
In other words, x must satisfy 


x=Axtd or (f-A)x=d. 


Suppose that the input-output matrix for our hypothetical four-sector econ- 
omy is given by 
0.05 0.09 0.09 0.19 
0.16 0.15 0.28 0.2] 
0.19 0.21 0.22 0.27 
0.27 0.04 0.35 0.02 


and that there is consumer demand of $23 billion for agriculture, $45 billion for 
energy, $39 for manufacturing and $12 billion for labor. The augmented matrix for 
determining x is then 


Az 


0.95 -0.09 -0.09 -019 | 23 
—0.16 0.85 -0.28 ~0.21 | 45 
-0.19 -0.21 0.78 -0.27) 39 
—0.27 ~0.04 -0.35 0.98 12 


Applying Gaussian elimination with back substitution, we find that an output of 
$64.59 billion from the agriculture sector, $127.27 billion from the energy sector, 
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$128.02 billion from the manufacturing sector and $80.96 billion from the labor 
sector will meet the indicated consumer demand. 


Operation Counts 


How much work does it take to perform Gaussian elimination with back substitution 
to obtain the solution to an arbitrary system of n equations? As a measure of work, 
we will use the number of arithmetic operations being performed. Let’s start with 
Gaussian elimination. The segment of pseudocode given below captures the major 
components of Gaussian elimination. 


for pass from 1 ton—1 
for row from pass +1 ton 
M. = —Arow,pass/@pass,pass 
Set Grow,pass = 0 
for col from pass + 1 ton+1 


Qrow,col — Grow,col + MQpass,col 


Note that we set Grow,pass = 0 to avoid an unnecessary calculation. The innermost 
loop has n+ 1 as its final index because each row in the augmented matrix also has 
the entry from the right-hand side vector. 

Traditionally, the number of additions and subtractions have been counted 
separately from the number of multiplications and divisions. On older computers, 
multiplication and division were significantly more time-consuming than addition 
and subtraction. On many modern architectures, however, multiplication is no 
more expensive than either addition or subtraction, and division is not even twice 
as expensive. We will therefore break from tradition and just count the total number 
of arithmetic operations. 

A scan of the pseudocode indicates that two arithmetic operations are per- 
formed each time the innermost loop is executed and one more operation is per- 
formed inside the middle loop. Therefore, the total number of arithmetic operations 
for Gaussian elimination is 


n-l n n+l n-1 n 
a J 1+ oS 2| = + ye (2n — 2pass + 3) 
pass=1 rov=passt+1] col=pass+1 poass=1 row=pass+1 
n-1 
= 2 (2n — 2pass + 3)(n — pass) 
pass=1 
2 ] 7 
= ae of 5” - rus 


To determine the number of operations required by back substitution, examine the 
pseudocode 


Lp = bn /Gnn 
for row from n —1 to 1 by -1 
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sum = brow 
for col from row +1 ton 

SUM = SUM — Orow,colX col 
Lrow = sum | Qrow,row 


One operation is performed before the loops, one inside the outer loop and two 
inside the inner loop. The total number of operations for back substitution is then 


n-l n 
i+ b+ ‘> | =» 
row=1 col=rowt+l 


In summary, solving a system of n linear equations inn unknowns by Gaussian 
elimination with back substitution requires 


arithmetic operations. 


Why Not Gauss-Jordan Elimination or Multiplication by the Inverse Matrix? 


Gaussian elimination with back substitution is not the only direct procedure avail- 
able for solving linear systems of equations. Two alternative strategies are Gauss- 
Jordan elimination and multiplication by the inverse of the coefficient matrix. An 
examination of the operation counts for these two techniques will make it clear why 
Gaussian elimination is preferred. 

Gauss-Jordan elimination replaces the elements both above and below the 
pivot with zeros and generates ones along the main diagonal (by using ERO2), 
thereby reducing the coefficient portion of the augmented matrix to the identity 
matrix. This process removes the need to perform back substitution and saves n? 
arithmetic operations. The elimination phase, however, requires more operations. 
The number of operations needed to reduce the coefficient portion of the augmented 
matrix to the identity matrix is 


n-l1 n ntl nm pass—1 ntl 
Dt ae OT De es De aha 
pass=)] row=passt1 col=pess+l pass=2 row=1 col=pass+1 
=n + nw—n. 


The two triple sums take into account the generation of zeros below the pivot and 
above the pivot, respectively, and the term following the summations corresponds 
to the n divisions needed to place ones along the main diagonal. Although the 
operation count still has a leading term of n®, the leading coefficient for Gauss- 
Jordan elimination is half again larger than that for Gaussian elimination. For any 
n larger than 2, the added cost of Gauss-Jordan elimination outweighs the savings 
achieved by not needing to perform back substitution. 
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Multiplying by the inverse of the coefficient matrix is even more expensive. 
First, it can be shown (see Exercise 9) that computation of the inverse matrix 
requires 2n3 — 2n? + n total arithmetic operations. The multiplication A~+b then 
requires 2n? — n additional operations, which is nearly twice the work needed to 
perform back substitution. 


EXERCISES 


In Exercises 1-5, write out the augmented matrix for the indicated linear system of 
equations and then obtain the solution using Gaussian elimination with back substitu- 


tion. 

1d. 24, — 22 + 23 = -1 
4y, + 242 + «3 = 4 
6z1 — 442 + 2x3 = 2 

2 ari + 2x9 + 23 = -1 
a) + 22 + 33 = ? 
$21 + 222 + $23 = 

3 a + 2x2 - «3 = |] 
24, - 22 + «3 = 3 
—a, + 222 + 323 = 7 

4 aw + «3 + 2 = 0 
321 + 343 - 444 = 7 
a + «2 + 23 + 2%, = 6 
20, + 302 + 23 + 344 = 6 

5. 321 ~ 22 + 343 + %G = 6 
621 + 973 -— 2% = 13 
—1221 —- j0%3 + Sag = -I17 
72x, —- 82g + 4823 - 19%, = 98 


6. Let U be an n x 7 upper triangular matrix. Show that 
det(U) = uriue2g3 °° *Unn. 


7. Suppose we had not assigned the value 0 to the element @row,pass in our Gaussian 
elimination pseudocode and had instead computed the value inside the innermost 
loop. How many arithmetic operations would that have added to the operation 
count for the elimination phase? 


8. (a) Construct an algorithm to carry out Gauss-Jordan elimination; that is, dur- 
ing each pass through the matrix, generate zeros both above and below the 
pivot element; after all m passes, place ones along the diagonal. 

(b) Show that the total number of arithmetic operations needed to solve a 
system of m equations in 2 unknowns using Gauss-Jordan elimination is 
mtn? —n. 
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9. The inverse of an n x m matrix can be computed by performing Gauss-Jordan 
elimination on an n x 2n augmented matrix, where the last n columns are the 
n x 7m identity matrix. 


(a) 


(b) 


10. (a) 


(>) 


() 


11. (a) 


(b) 


(c) 


Show that if one naively applies Gauss-Jordan elimination without taking 
into account the structure of the identity matrix, then computation of the 
inverse requires 3n° — 2n* arithmetic operations. 


Show that if one takes into account the structure of the identity matrix (and 
does not perform multiplication when the matrix element is a one and does 
not perform addition/subtraction when one of the elements is known to be 
zero), then computation of the inverse can be reduced to 2n° — 2n? +n 
operations. 


Solve the system 


3.02x1 ~ 1.05¢0 + 2.53843 = -1.61 
4.332, + 0.562. -— 1.7823 = 7.23 
-0.832; — O.54¢2 + 14723 = —3.38 


using Gaussian elimination with back substitution. 

Change the coefficient of x) in the first equation to 3.01 and solve the 
resulting system. By what percentage have the three components of the 
solution vector changed? 

Return the coefficient of x1 in the first equation to 3.02, but change the right- 
hand side of the last equation to —3.39 and solve the resulting system. By 
what percentage have the three components of the solution vector changed 
from their values in part (a)? 


Solve the system 


6a, — 2@2 + 8@3 = 5 
ay — da. + 423 = 2 
a + 822 - #3 = 5 


using Gaussian elimination with back substitution. 

Change the coefficient of 2 in the first equation to 6.01 and solve the 
resulting system. By what percentage have the three components of the 
solution vector changed? 

Return the coefficient of x; in the first equation to 6, but change the right- 
hand side of the second equation to 1.99 and solve the resulting system. By 
what percentage have the three components of the solution vector changed 
from their values in part (a)? 


12. Let A be the n x m matrix whose entries are given by a;j = 1/(i + 9 — 1) for 
lsagsgn. 


(a) 


For n = 5,6 and 7, solve the system Ax = b using single precision arith- 
metic. In each case, take b as the vector that corresponds to an exact 
solution of 2; = 1 for each i = 1,2,3,...,7. Calculate the maximum 
component-wise error between the computed solution and the exact solu- 
tion for each n. 


13. 


14, 


15. 
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(b) For n = 11,12 and 18, solve the system Ax = b using double precision 
arithmetic. In each case, take b as the vector that corresponds to an ex- 


act solution of 2; = 1 for each 7 = 1,2,3,...,n. Calculate the maximum 
component-wise error between the computed solution and the exact solution 
for each n. 


The circuit shown in Figure 3.4 could be used as part of a system for charging 
a car battery. Assuming that the internal resistance of the generator and the 
battery are negligible and applying Kirchoff’s loop equation around the left and 
tight loops of the circuit (traveling counterclockwise about the left loop and 
clockwise around the right loop) produces the equations 


—4Io + 1513 = 12 


and 
101, + 1513 = 100. 


Balancing the current flowing into and out from the junction between the 4-0 
and 10-2 resistors yields the equation J, = [2+J/3. Determine the current flowing 
through each branch of the circuit. 


DC 
Generator 
100 V 


4Q 10 Q 


Figure 3.4 Figure for Exercise 13. 


Consider a simple economy that consists of three sectors: food, clothing, and 
shelter. The production of one unit of food requires 0.43 units of food, 0.17 units 
of clothing and 0.18 units of shelter. The production of one unit of clothing 
requires 0.08 units of food, 0.23 units of clothing, and 0.28 units of shelter. 
The production of one unit of shelter requires 0.23 units of food, 0.16 units of 
clothing, and 0.14 units of shelter. If consumer demand is for $90 million worth 
of food, $32 million worth of clothing, and $245 million worth of shelter, what 
total output from each sector is needed? 


Suppose the coefficient matrix and the control vector for the longitudinal dy- 
namics of an aircraft are given by 


-0.0507 —3.861 0 ~32.17 
—0.00117 —0.5164 1 0 
—0.000129 1.4168 —0.4932 0 

0 0 1 0 


A= 
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and b= [0 -0.0717 -1.645 0 ie respectively. In order to change the 
open loop coefficient vector a = [ 1.0603 -1.115 -0.0565 —-—0.0512 ie 


the closed loop coefficient vector 4 = [ 2.52 6.31 0.150 0.0625 ike the gain 
vector, g, in the feedback control law must satisfy the equation 


into 


(QW)"g =a-a. 
The matrix @ takes the form Q = [ b Ab A’b A®b ls and 


1 1.0603 -—1.115 —0.0565 

w= 0 1 1.0603 1.115 
0 0 1 1.0603 
0 0 0 1 


Compute g. 


16. Solve the system of equations associated with the “Forces in a Plane Truss” 
problem capsule presented in the Chapter 3 Overview (see page 138). 


3.2. PIVOTING STRATEGIES 


It is sometimes necessary during Gaussian elimination to interchange rows while 
solving a system of linear equations so as to avoid a zero pivot element. When 
performing calculations in finite precision arithmetic, it may also be necessary to 
interchange rows to reduce the effect of roundoff error on the computed solution. 
In this section, we will first illustrate the problems which can arise when solving 
linear systems using finite precision arithmetic. We will then discuss two strategies 
which can be used to reduce the effects of roundoff error. 


An Example to Motivate the Discussion 
Consider the system of three equations in three unknowns 
; 22 ae a3 = 
gti gta —- $23 = 
5 me Tt 5 


i 


en| omleoted as 


Working in exact arithmetic, the first pass of Gaussian elimination requires that 
the elementary row operations rg ~ Rz — 5Ry and 73 «- R3— rapa be carried out. 
This yields the equivalent system 


2 2 i = 43 
$2, + Flq + gs = 18, 
— 223 = pee 

_ 36 Tes aa ies 

a 22 + 503 a 50 


Interchanging the second and third equations produces an upper triangular system, 
from which the exact solution z; = 1, 22 = 7, and z3 = | is obtained by back 
substitution. 
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What happens when we try to solve this system using four decimal digit 
rounding arithmetic? First, we replace each of the rational numbers in the original 
system by its floating point equivalent. This transforms the original system to 


0.66672; + 0.2857t2 + 0.200023 = 2.867 
0.33332, + 0.1429%. -— 0.5000r3 = 0.8333 
0.20007, - 0.42862. + 0.400073 = —-2.400. 
The first pass of Gaussian Elimination produces 
0.66677, + 0.28577. + 0.200023 = 2.867 
0.0001z2 - 0.600073 = —0.5997 
—0.5143%2 + 0.3400z3 = -3.260. 


The coefficient on x2 in the second equation should have been zero, but cancellation 
error has left us with a small nonzero coefficient, which becomes the pivot for the 
second elimination pass. This second pass then produces the upper triangular 
system 


0.66672; + 0.285722 + 0.2000r3 = 2.867 
0.0001lz2 — 0.600023 = —0.5997 
—308623 = —3087. 


Turning to back substitution, we find z3 = (—3087)/(—3086) = 1.000. Sub- 
stituting this value into the second equation and solving for zo yields 


ope 0:5997 = (1.000)(—0.6000) _ 0.0003 
2 0.0001 ~ 0.0001 


Note the cancellation error in the calculation of the numerator. In fact, three of 
the four significant digits have been lost. This cancellation error is then magnified 
when we divide by the small pivot element, producing a value which is in error by 
more than 57%. Substituting x3 and x2 into the first equation leads to 2) = 2.715, 
which is in error by nearly 200%. 

Why is the computed solution so inaccurate? We can ultimately trace the 
problem back to the cancellation error introduced in the first elimination pass. 
This generated the small pivot element, 0.0001, for the second pass. Additional 
cancellation error was introduced during back substitution, and this error was then 
magnified when we divided by that same small pivot element. Now, during Gaussian 
elimination, there is always going to be the possibility of cancellation error when 
the third elementary row operation is applied. There isn’t much we can do about 
this. However, we can and, as this example clearly demonstrates, should avoid 
using small pivot elements. 


= 3.000. 


Partial Pivoting 


To avoid small pivot elements, we can employ a pivoting strategy. In general, a 
pivoting strategy is any systematic scheme for interchanging the rows (and possibly 
the columns) of the coefficient matrix to place a selected element in the pivot 
position. The simplest such scheme is known as partial pivoting. 
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PARTIAL Pivotine During the ith elimination pass of Gaussian elimi- 
nation, let 
M,= max |a;; 
t= me dasa 
and let jo be the smallest value of j for which this maximum occurs. If 
jo > 4, then interchange rows i and jo. 


In other words, we find the element in the pivot column, starting from the 
i-th row and continuing to the bottom of the matrix, which is of largest magnitude, 
and then make that element the pivot element. 


eSeSeSeeeeeeSSSSSSFSSSSSSSMMSSESE 
EXAMPLE 3.4 Partial Pivoting in Action 


Reconsider the system from above, whose representation in a four decimal digit 
floating point system with rounding was 


0.66672; + 0.285722 + 0.2000r%3 = 2.867 
0.33332, + 0.1429%2 - 0.5000z3 = 0.8333 
0.20002, — 0.4286z2 + 0.4000r3 = ~2.400. 


The first pass of Gaussian elimination proceeds exactly as before because the largest 
element in the first column (7 = 1) is initially in the first equation (jy = 1). Since 
jo = i, no interchange of equations is required. Hence, the second pass starts from 


0.66672; + 0.285722 + 0.2000x3 2,867 
 0.000lz2 — 0.600023 = -0.5997 
—0.5148¢2 + 0.3400z3 —3.260. 


Here, note that the largest element in the second column (i = 2) is located in the 
third equation (jp = 3). The partial pivoting strategy therefore requires that the 
second and third equations be interchanged. This yields 


0.66672, + 0.285722 + 0.2000z3 = 2.867 


—0.5143%2 + 0.3400%3. = —3.260 
Q.0001z2 -— 0.6000z3 = —-0.5997, 


from which the elimination of x2 from the last equation leaves us with 


0.66672; + @.2837r. + 0.2000%3 = 2.867 
—0.51432. + 0.340023 = —3.260 
—0.5999%3 = —0.6003. 


Back substitution produces the solution 23 = 1.001, «2 = 7.000, and 2, = 1.000. 
To four digits, the values of 2, and zg are exact, while the value of x3 is in error 
by only one-tenth of one percent. 
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In the preceding example, the necessary row interchange was carried out ex- 
plicitly so as not to draw attention away from the action of the pivoting strategy. 
In practice, it is usually more efficient to handle row interchanges implicitly. This 
is accomplihsed by maintaining a vector of n elements, such that the ith element 
of the vector indicates the row within the matrix that contains the coefficients for 
the ith equation. Let’s denote this row vector by r. The vector is initialized to 

r=(123 «=. nl’. 
With this vector, each time a row interchange is required, we need only swap the 
corresponding elements of the vector. 

Handling row interchanges in an implicit manner does require one important 
change to our Gaussian elimination and back subsitution algorithms. Every refer- 
ence to a row of the coefficient matrix or to a row of the right-hand side vector must 
be made through the row vector r. For instance, the coefficient of the fifth variable 
in the seventh equation must be accessed as a,,,5, while the right-hand side of the 
third equation must be accessed as 6,,. 


EXAMPLE 3.5 A System with Four Equations 


Consider the system whose augmented matrix is 


38 1 4 -1 7 
2-2 -1 2 1 
5 7 14 ~8} 20 
13 2 4 —4 


The exact solution of this system is 
x= (4 2) a. oe ]*, 


Solving this system using Gaussian elimination without pivoting in four decimal 
digit rounding arithmetic produces the solution 


x= [1131 0.7928 0.8500 —0.9987 ]* 
(see Exercise 18). 
Initialize the row vector to 
r=[1 23 4]. 


To determine the location of the pivot, examine the values 


lar, 1, = 3, [@rgi)=2, |@rga/=5, and jai] =1. 


The largest value in this list corresponds to row rg, 80 jo = 3. Since jp = 3 > 1 =i, 
we need to swap the first and third elements in the row vector. Thus, for the frst 
pass, we have 


r=[3 2 1 4]. 
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After the elimination pass, the matrix becomes 


0 -~3.200 -4.400 3.800 | —5.000 
0 -4.800 -6.600 5.200 | —7.000 
5.000 7.000 14.00 -—8.000 | 20.00 
0 1.600 -0.8000 5.600 | —8.000 


‘To determine the location of the next pivot, examine the values 
lrz,2| = 4.800, |a,32] = 3.200, and |a,, 2| = 1.600. 


The largest value in this list corresponds to row rz, so jo = 2. Since jo =2 =i, no 
row interchange is needed for the second pass. 
The second elimination pass produces the matrix 


0 - 60 0 0.3330 | —0.333 

0 -4.800 -6.600 5.200 —7.000 
5.000 7.000 14.00 -8.000] 20.00 

0 0 —3.000 7.333 —10.33 


The location of the final pivot is determined by examining the values 
|ar5,3] = 0.0000 and |a,,,3| = 3.000. 


The largest value here corresponds to row T4, 80 jp = 4. Since jp = 4 > 3 =1, we 
need to swap the third and fourth elements in the row vector, giving 


aoa ae ae oe 


Since the element in the a,,.3 = 41,3 position is already zero, the third elimi- 
nation pass makes no changes to the matrix. The final contents of the matrix and 
the row vector are therefore as shown above. Back substitution now yields 


bre 1 _ 0.833 agg 


a4 = = = 
Org a1,4 0.333 
bry — Or, 4. ba — Q4,42 
ag = ad _ MA PA _ 9) 9990 
Ors 3 04,3 
bry — Ary 323 — Ory AZ. bo — @2.3%3 — G2.4%4 
Qo ee ee 0.0085 


Ory ,2 &2,2 
ay = DE = Orn 202 = Ory 33 ~ Orr AF4 _ 4 Og, 
Ory yh 


The maximum error in any component of this solution is only 0.15%, which is a 
dramatic improvement over the solution obtained without pivoting. 
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Scaled Partial Pivoting 


Partial pivoting works well in many instances but does not reduce the effects of 
roundoff error for all problems. Consider the system 


0.72, + 1725%. = 1739 
0.43522, -— 5.43322 = 3.271, 


whose exact solution is x; = 20 and x2 = 1. If we were to solve this system using 
four decimal digit rounding arithmetic with partial pivoting, we would leave the 
equations in the order listed since the coefficient on x1 in the first equation (0.7) 
is larger than the coefficient on x; in the second equation (0.4352). Eliminating 71 
from the second equation, we obtain the equivalent system 


0.72; + 172542 = 1789 
—1077z2 = -—1078. 


Back substitution produces the solution 72 = 1.001 and 2; = 17.14. The value of 
£q is in excellent agreement with the exact value, but the value of z, is in error by 
more than 14%. 

Partial pivoting performed poorly on this system because it did not take into 
account the sizes of the coefficients in locations other than the pivot column when 
selecting the pivot. In this case, although 0.7 is larger than 0.4352, when measured 
relative to the other coefficients in each equation, 0.7 is actually smaller than 0.4352; 
that is, 

0.7 0.4352 

1725 5.433 ? 
where 1725 and 5.433 are the absolute values of the coefficients of greatest mag- 
nitude in the first and second equations, respectively. Had we decided to choose 
the element in the pivot column which is largest in magnitude relative to the other 
coefficients in its equation, then we would have switched the order of the equations 
in this system prior to eliminating variables: 


0.43522, —- 5.43322 = 3.271 
0.72, +  1725%2 = 1789. 


Now eliminating x, from the second equation yields the equivalent system 


0.43522, — 5.433822 = 3.271 
1734%2 = 1734. 


Back substitution from this set of equations yields zz = 1.000 and x, = 20.00, 
which are exact to four digits. 
The pivoting strategy we have just applied is known as scaled partial pivoting. 


SCALED PARTIAL PivoTING Before starting Gaussian elimination, con- 
struct a scale vector s as follows. For each 1 <i <n, let 


6: = max las, 
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Also, initialize the row vector to 
r=[1 23 -- on 


During the ith elimination pass, let 


and let jo be the smallest value of j for which this maximum occurs. If 
jo > 1, then interchange rows i and jo. 


Note that while the row vector will generally change from pass to pass, the 
scale vector is set at the beginning of the process and is not changed thereafter. We 
could modify the scale vector after each pass if we wanted, but for the majority of 
linear systems the added calculations produce negligible benefit. 


EXAMPLE 3.6 Scaled Partial Pivoting in Action 


Reconsider the system whose augmented matrix is 
1 4 -1| 7 


—2 -l 2 1 
7 14 -8] 20 


rob GO 


and whose exact solution is 
a ae ae 


Earlier, we noted dramatic improvement in the computed solution of this system 
when using partial pivoting. Let’s now examine the effect of using scaled partial 
pivoting. 

The first step is to construct the scale vector. Since 


j= lo |=14 and max |aq;|=4 
max, las! 4, faa laa. 2, iejea [as0 ee od : 
we find 


s=[4 2 14 4]’. 
Next, we initialize the row vector to 
r=[1 23 4]. 
To determine the location of the pivot, we examine the values 


onal 3, lemal 2, Bratl  & ang Meus! 2 


, 
Sry 4? 3p, 2° Spy Sry 
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The largest value in this list corresponds to row re, 80 jp = 2. Since jp = 2 > 1 =i, 
we need to swap the first and second elements in the row vector. Thus, for the first 
elimination pass, we have 


r=[2 1 3 4)". 

After the elimination pass, the matrix becomes 
0 4.000 5.500 -4.000 |] 5.500 
2.000 —2.000 -—1.000 2.000 1.000 


0 12.00 16.50  -13.00 | 17.50 
0 4.000 2.500 3.000 | —4.500 


To determine the location of the next pivot, examine the values 


Jornal _ 4000 Jargal _ 12.00 [anual __ 4.000 
Sry 4 Sry 14 Srq 4 


The largest value in this list is 1, which occurs for both row re and row rg. Choosing 
the first occurrence of the maximum value, we have jp = 2 = 7. Hence, no row 
interchange is needed for the second pass. 

The second elimination pass produces the matrix 


0 4.000 5.500 -—4.000 5.000 
2.000 -—2.000 -—1.000 2.000 1,000 

0 0 0 —1.000 1.000 

0 0 —3.000 7.000 —10.00 


The location of the fnal pivot is determined by examining the values 


lara] Og Larassl _ 3.000 
Sr5 14 Sr, 4 


The largest value here corresponds to row 74, SO Jo = 4 > 3 =1%. We therefore need 
to swap the third and fourth elements in the row vector, giving 


oe aoe ae alae 


Since the element in the a,,,3 = @3,3 position is already zero, the third elimi- 
nation pass makes no changes to the matrix. Back substitution now yields 


br, 63 —:1.000 


La = = = = —1.000 

Ar4,4 a3,4 —1.000 

Dry — Gry 424 bg — Og 404 
gg = 2 = = 1,000 

Gry 3 4,3 

Ory — Grg,323 —Org,4%4 — b1 — 01,303 — O1,424 

t= T2 T25 Tay — » D = —1.000 
Ory 2 Q1,2 

bp, — Gr, 282 — Op, 3%3 — Or, 404 

gy =— EUs Ths ae = 1.000. 
ar, ‘1 


This solution is exact to four decimal digits. 
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EXERCISES 


1. For each of the following augmented matrices, identify the entry that would serve 
as the first pivot element for 
(i) Gaussian elimination with no pivoting; 
5 Gaussian elimination with partial pivoting; and 
(iii) Gaussian elimination with scaled partial pivoting. 


011 14/0 
77009 zs 7 
@ } 444 6 
2 i 1 6 
ee 6 
G@ <9 | 19 
m1 7 10 °°} | 17 
-8 48 -19| 93 
~1.78 4.33 | 7.23 
(c) | 2.53 -1.05 3.02 | -1.61 
147 —0.54 —0.83 | —3.38 
0.25 0.35 0.15 | 0.60 
(d) | 0.20 0.20 0.25 | 0.90 
0.15 0.20 0.25 | 0.70 


0.2115 2.296 §=2.715 3.215 | 8.438 
(e) 0.4371 3.916 1.683 2.852 | 8.888 
6.099 4.324 23,20 1.578 | 35.20 
4.623 0.8926 15.32 5.305 | 26.14 


For the augmented matrices indicated in Exercises 2-6, show the contents of the matrix 
after one pass of 
(i) Gaussian elimination with no pivoting; 
(ii) Gaussian elimination with partial pivoting; and 
(iti) Gaussian elimination with scaled partial pivoting. 
For (ii) and (iii), show the contents of the row vector, and for (iii), show the contents 
of the scale vector. 
2. The augmented matrix from Exercise l(a). 
. The augmented matrix from Exercise 1(b). 


3 

4. The augmented matrix from Exercise l(c). 
5. The augmented matrix from Exercise 1(d). 
6 


. The augmented matrix from Exercise 1(e). 


In Exercises 7-12: 
(a) Solve the indicated system using Gaussian elimination with partial pivoting. 
Show all intermediate matrices and the row vector at each step. 
(b) Repeat part (a) using Gaussian elimination with scaled partial pivoting. Show 
the contents of the scale vector. 


7. 24, + 329 
4m, + 29 
34. + 422 

8. 2%, - 2 
4x, + 2xe 
621 = 4x. 

9. 322 
r+ «(xe 
22, + 5ae 

10. Ly + 829 
-—32, -— 422 
221 + 429 

11. Ly — 3x2 
24, + 4x2 
—32, + Tx2 

12. 2) - 229 
zy. + 522 
321 + 22 
24, + 322 


tlt tte tet 


4s 


l++ 


41+ 


£3 
4x3 
623 
x3 
z3 
223 
23 
223 
423 


6x3 
923 
623 


723 

323 

223 
r3 
7x23 
5x3 
523 


13. Show that when the system 


3 
2 
5 
1 


14 
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—5 

2 

1 

17 

—1 7 
2 1 
—8 

4 —4 


20 


Pivoting Strategies 


169 


is solved using Gaussian elimination with no pivoting and four decimal digit 
rounding arithmetic, the resulting solution is 


x = [ 1.181 0.7928 0.8500 —0.9987 ]”. 


In Exercises 14-17, solve the given system in the indicated finite precision arithmetic 


using 


(i) Gaussian elimination with no pivoting; 
(ii) Gaussian elimination with partial pivoting; and 
(iii) Gaussian elimination with scaled partial pivoting. 


Compare the results obtained from each technique with the exact solution of the system. 


14. 3 decimal digit rounding arithmetic 


15. 3 decimal digit rounding arithmetic 


0.52, + Ldleg + 
2.027, + 4522 + 
9.021 + 0.9620 + 
3.412, + 1.23822 —- 
2.712, + 2.1422 + 
1.892; - 1.9lzg —- 


3.123 
0.3623 
6.523 


1.0923 
1.2923 
1.8923 


fT 


Ut 


6.0 
0.02 
0.96 


4.72 
3.10 
2.91 
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16. 


17. 


18. 


19. 


20. 


4 decimal digit rounding arithmetic 


3 64 4 -1 7 
2 -2 -1 2 1 
5 7 14 -9] 21 
1 3 2 4 ~4 


4 decimal digit rounding arithmetic 


1.9852. — 1.358%. + 2.11323 = —5.56 
0.95382, — 0.652222 -— 1.81573 = 0.1592 
2.6072; + 0.206522 + 3.7973 = —0.357 


Let A be the n x n matrix whose entries are given by ayy = 1/(i+ 7 — 1) for 

1sajecn 

(a) For n = 5,6 and 7, solve the system Ax = b using single precision arith- 
metic. In each case, take b as the vector that corresponds to an exact 
solution of 2; = 1 for each i = 1,2,3,...,n. Compare the solutions ob- 
tained using Gaussian elimination without pivoting, with partial pivoting 
and with scaled partial pivoting. 


(b) For n = 11,12, and 13, solve the system Ax = b using double precision 
arithmetic. In each case, take b as the vector that corresponds to an exact 
solution of 2; = 1 for each i = 1, 2,3,...,2. Compare the solutions obtained 
using Gaussian elimination without pivoting, with partial pivoting and with 
scaled partial pivoting. 


Solve the following system in single precision arithmetic. 


-149%, -— 50z2 —- 154%3 = 383 
5377, + 180¢g + 54623 = ~—1263 
272, = Dare =F 2523 = 61 


Use Gaussian elimination without pivoting, with partial pivoting and with scaled 
partial pivoting. Which technique provided the most accurate solution? The 
exact solution for this problem isx=[-1 -1 -1 i 


Solve the following system in double precision arithmetic. 


—9 il —2) 63 ~252 —356 
70 —69 141 42] 1684 2385 
—575 575 1149 3451 —18801 | —19551 
3891 3891 7782 —23345 93365 132274 
1024 -—1024 2048 —6144 24572 34812 


Use Gaussian elimination without pivoting, with partial pivoting and with scaled 
partial pivoting. Which technique provided the most accurate solution? The 


exact solution for this problem is x = [ 1 -1 1 -1 1 | 
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3.3. VECTOR AND MATRIX NORMS 


Partial pivoting and scaled partial pivoting were introduced as strategies for reduc- 
ing the impact of roundoff error during the Gaussian elimination process. But how 
much error can we expect when solving a system of linear equations, and how does 
the error depend on the properties of the coefficient matrix and the right-hand side 
vector? Further, when A and b are known only approximately, how is the error in 
the solution related to the errors in the data? 

To provide meaningful answers to these questions, tools are needed for mea- 
suring the “size” of a vector and the “size” of a matrix. The development of such 
tools will be the focus of this section. 


Vector Norms 


When working with scalar quantities as in Chapter 2, it was natural to measure 
size using the absolute value function. To measure size when working with n- 
dimensional vectors, a generalization of absolute value, referred to as a, vector norm, 


is needed. 
Definition. The function || - ||: R”® — R is called a Vecror Norm if, for 
all x,y € R” and all a € R, the following properties hold: 
(i) [Ixl| 2 0; 
(ii) ||x|| = 0 if and only if x = 0; 
(iii) lox = jal [[xIl; and 


(iv) [x+y] < [xl + lly. 


Note that the standard absolute value function satisfies each of these properties for 
n= 1. Property (iv) is often called the triangle inequality. 

There are infinitely many vector norms that can be constructed. Here, we 
will restrict our attention to the two most commonly used in practice. A third will 
be considered in the exercises. The two norms on which we will concentrate are the 
lg, or Euclidean, norm and the /.., or maximum, norm. 


Definition. Let x € R”. The ly-norm of x, which is denoted by ||x|l2, is 


defined by 
% 1/2 
IIx\l2 = (5-1) 
t=1 


The Jo9-norm of x, which is denoted by ||x||,o, is defined by 


|X|loo = max |zil. 


We will now show that || - |l2 satisfies the properties of a vector norm. The 
verification that || - ||oo satisfies the required properties is left as an exercise. 
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EXAMPLE 3.7 | + |l2 Is a Vector Norm 


To establish that || - ||2 is a vector norm, we must show that |! - |l2 satisfies each 
of the four properties of the definition, In what follows, let x and y be arbitrary 
n-vectors, and let a be an arbitrary real number. 


(i): |Ixlle 2 0 


Since 2? > 0 for any real number 2, it follows that 


( a 1/2 
x\l2 = (5-7) > 0. 


(ii): \|x|lo = 0 if and only ifx =0 

If x = 0, then 2; = 0 for each i. Therefore, 02? = 0 and ||x||2 = 0. 
Conversely, if ||x||) = 0, then S\ 2? = 0. This can happen only if 2; = 0 for each i, 
so x = 0. 
(iii): loxll2 = |e ||xll2 


jaxlla = (So(ows)*)” = (0? Sy?) = tel (la?) = lal 


(iv): |x + yll2 < [lxll2 + lly lle 
To show that || - ||z satisfies this property requires the Cauchy-Buniakowski- 
Schwarz inequality, which states that for any x,y € R”, 


n 
S Lei 


i=l 


< |Ixllllylle- 


We will simply apply this result here and provide a proof below. 
Ix +ylig = S (eit yi)? 
= yo +2 3 ty + Sou 
< jx + 2[ 55 asa] + lly 
< |l>x[l3 + 2llxllallyll2 + lly 
= (||x\l2 + llylla)” 


Upon taking the square root of both sides, the triangle inequality results. 


For completeness, we now restate and prove the Cauchy-Buniakowski-Schwarz 
inequality. 
Theorem. Let x,y € R”. Then 


n 
Loa 
i=1 


< ||xIlallyll2- 
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Proof. The inequality is trivially satisfied if x = 0 or y = 0. Therefore, 
suppose x and y are both nonzero. Let be any real number. Then 


OS [Ix +Ayll3 = So (a + Avi)? 
= Sox +290 ays F eS 


= |Ixlp +24 5° aey; + d?Ily[3. 


Let a = |ly||2, 6 = So 2iy, and ¢ = ||x||2. The preceding inequality then 
becomes 
ar? + 2A +c>0 


for all A€ R. This can happen if and only if the discriminant, (2b)? — 4ac, is 
non-positive. Hence b? < ac. Substituting the values for a, b, and ¢ gives 


n 2 
(s sn) < [xlBlly 3, 
i=l 


from which the required inequality follows upon taking the square root of both 
sides. O 


EXAMPLE 3.8 Calculating Vector Norms 


Consider the three vectors 


x=[2 0 -1 2]7; 


xs=(O 1 -4 2 -1]*, 


The maximum norm of each of these vectors is computed as follows: 


I[X2lloo = max{|2/,O|,| — 1], |2|} = 2; 
I|Xalloo == max{|0],|1|,| - 4], /21,| - 1} = 4 


The Euclidean norm of each vector is 


I[xall2 = /1? + (—2)? + 3? = V14 © 3.74; 
\|xall2 = /2? + 02 + (—1)? + 2? = V9 = 3; 


\Ixsllo = /0? + 12 + (—4)? + 2? + (-1)? = V22 = 4.69. 
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We will, of course, use vector norms for more than just assigning a size to a 
vector. One of the primary uses of vector norms will be to establish the convergence 
of a sequence of vectors, say {x'*)}. This is done by showing that ||x — x|] > 0 
as k — oo for some limit vector x. Given that different norms can be used to 
establish convergence, two important questions naturally arise. First, is it possible 
for a sequence to converge in one norm but to diverge in another norm? Second, is 
it possible for a sequence to converge to different limit values in different norms? 
Fortunately, the answer to both of these questions is no! 

The reason that the choice of vector norm is irrelevant when considering con- 
vergence is that, on R”, all vector norms are equivalent. 


Definition. Let || - || and |! - ||/ be vector norms on R”. If there exist 
positive constants c, and cy such that 


e1||x\] < |lxII' < all! 
for all x € R”, the two norms are said to be EQUIVALENT. 


The connection between equivalence and convergence can be explained as 
follows. Suppose the sequence {x‘*)} converges to x in the || - ||-norm; that is, 
\\x‘*) — x\| + 0 as k > 00. The right side inequality in the equivalence definition 
then guarantees that ||x‘*) — x/||' + 0 as k — oo. Hence, the sequence converges to 


the same limit value in the || - ||’-norm. Similarly, if {x} converges to x in the 
i| - ||'-norm, the left side inequality in the definition guarantees convergence of the 
sequence to the same limit value in the || - ||-norm. 


Establishing that the Ip-norm and the J,9-norm are equivalent is straightfor- 
ward. Let x € R®™ and suppose that 2; is a component for which ||x|loo = |x]. 
Then 


2 
x2, = [ay? = 25 < Soa; 


Thus, 
“s 1/2 
IIxlloe S (o=!) = ||xll2 < Va|x\leo, 
i=l 


$0 ¢; = 1 and cz = »/n in the above definition. The proof that any pair of vector 
norms on R” are equivalent can be found in Ortega [1] or Ortega and Rheinboldt (2I. 


Matrix Norms 


To measure the errors introduced during the solution of a linear system, it will be 
necessary to have a means for quantifying the “size” of a matrix. This is done using 


matrix norms. 
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Definition. A Matrix Norn is a function || - ||: R?*" — R that, for all 
A,B € R®*" and all a € R, satisfies 
(i) Al] = 9; 
(ii) || Al] = 0 if and only if A = 0; 
(iti) |loAl| = lal |] All; 
(iv) ||A+ Bl] < |[Al| + ||5||; and 
(v) ||ABll < All IB. 


Although the same symbol is used to denote both vector and matrix norms, 
the type of norm being used should be clear from the specific context. 

As with vector norms, there are various ways to obtam matrix norms. We 
will, however, restrict attention to those matrix norms which are related to vector 
norms. These are referred to as the natural matrix norms. 


Definition. Let || - ||, be a vector norm. The real-valued function || - || 
that is defined for all A € R"*” by 


= xX 
i040 ||x|[, 


is called the NATURAL, or OPERATOR, NORM associated with (generated by, 
induced by) the vector norm || - lo. 


All natural matrix norms possess an important consistency property. Since 
\| All is defined as the maximum of the ratio ||Ax||,/||x||,, it follows that for any 


nonzero n-vector X 
yay > Lele 
= + 
|||» 


or, equivalently, ||Ax||, < ||All ||xll.. This inequality is often used to provide a 
bound on the value of ||Ax||, and plays a central role in proving that a natural 
norm possesses the properties in the definition of a matrix norm. 


Theorem. Let || + ||, be a vector norm. The natural norm associated with 
| - |ly is a matrix norm. 


Proof. Properties (i) and (iii) of a matrix norm follow directly from the 
corresponding properties of a vector norm. To establish property (ii), note 
that 


|| Al| = 0 < ||Ax||, = 0 for all x £0 
¢ Ax =0 for allx #0 
&A=0. 


For property (iv), let x be any nonzero n-vector. Then 


(A+ B)x||y = | Ax + Bx|ly S ||Axlly + || Bxlly 
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by the triangle inequality for the vector norm. By the consistency property 
of the natural norm, ||Ax|ly < [|All bly and ||Bxll. < ||B|||xllo- Thus 


(A+ B)xlly < [All lxlle + BI xl, 


or 
(A + B)xlle 


IIxllv 


Property (v) can be established in a similar manner and is left as an exercise. 


| 


S|AN+ (5 = {A+ Bil <All + BI. 


The maximum matrix norm, ||Alloo, is fairly easy to calculate in terms of the 
entries in A. For, suppose that x is any nonzero n-vector. Recall that the ith 
component of the product Ax is given by 


Therefore, 


n nm 
do auja;] S max Ye las les] 


|| Ax|loo = max 
t 


j=l j=l 
Tr nm 
< max |arj| max S> |azj| = [lXlloo max S| lass], 
j ee frie 
j=l j=l 
from which it follows that 
is 
\|Alloo < Hie Jax;|. (1) 
= 


Now, let & be an index for which 


nm 


n 
S 7 lang] = max) | lai), 
r j=l 


j=l 


and define the vector x by 


ath lal arj = 0 
ES -1, akj < 0° 


Then a432; = |@x3| for each 7. For this x, 


nr m 
| AX|loo = max Ye aijary Pa SO anys 
j=l 


j=l 


n n 
= YP lag] = [loo max > | |aas|, 
j=l j=) 
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where we have used the fact that ||x||5. = 1. Therefore, 


nr 
\|Alloo 2 max > | Jas). (2) 


j=l 


Combining equations (1) and (2) yields 
n 
|Alfoo = max S > Jas) 
j=l 


Since ||Alloo is based on sums of absolute values of the entries along each row of A, 
the /.. natural matrix norm is also referred to as the row norm of A. 

The l, natural matrix norm, unfortunately, is not as straightforward to cal- 
culate and requires knowledge of the eigenvalues of the matrix. 


Definition. Let A €¢ R®””. If for some number A (which may be complex) 
there exists a nonzero vector x such that Ax = Ax, then 4 is an EIGENVALUE 
of A and x is an EIGENVECTOR corresponding to A. 


The eigenvalue relation Ax = Ax is equivalent to the linear system 
(A -— AI)x = 0. For this system to have a nonzero solution for x, the matrix 
A-—AI must be singular. Thus, the eigenvalues of A are those values of for which 
det(A — AI) = 0. As a function of 4, det(A — AZ) is an nth-degree polynomial, 
known as the characteristic polynomial of A. So, counting multiplicities, an n x n 
matrix has precisely n eigenvalues. The set of all eigenvalues for a given matrix A 
is called the spectrum of A and is denoted by a(A). 


EXAMPLE 3.9 Calculating Eigenvalues 
Consider the matrix 
ih ie 0 
~/ 10 13 ]° 
The characteristic polynomial associated with this matrix is 
pA) = det(A — AZ) 
18 — 10 
= dee(| i, Bed }) 
= (18 — A)(13 — A) — 100 = A* — 31 + 134. 
The eigenvalues of A are the roots of this polynomial: 


y — She v3? = 4(134) _ 314 V425 _ 314 5V17 
; = 


2 2 
As a second example, consider the 3 x 3 matrix 
2 -1 1 


A=]|]-l1 2 90 
1 0 6 
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The characteristic polynomial associated with this matrix is 


p(A) = det(A — XZ) 
= (2-A)(2—A)(6- A) — (2-4) - (6- A) 
= —)? + 10? — 26A + 16. 
To five decimal places, the roots of this polynomial, and hence the eigenvalues of A, 


are 
Ay = 0.89722, 2 = 2.85363 and A3 = 6.24914. 


eee —— | _ 


One of the most important quantities related to the eigenvalues of a matrix 
is the spectral radius. 


Definition. The Spectra Rapius p(A) of the matrix A is defined by 


A)= Xd. 
(A) es | | 


The relationship between the spectral radius and the norm of the matrix is provided 
by the following theorem. 


Theorem. Let A be an n x n matrix. Then 
(i) Alle = Ve(A" A); 
(ii) e(A) < [|All for any natural norm; and 
(iii) for any ¢ > 0, there exists a natural norm || - || for which || Al| < e(A) +e. 


Proof. (i) and (iii): See Isaacson and Keller [3] or Ortega [1]. 
(ii) Let A € o(A) with associated eigenvector x. Taking the norm of both 
sides of the eigenvalue relation Ax = Ax then yields 


[A] lll] = Axl] = |]Ax]) $< All Ibe, 


or 
Al < |All. 


Therefore, 
A) = max |Al < |All. QO 
p(A) an | < ||Al 


Note that conclusions (ii) and (iii) of this theorem indicate that the spectral 
radius is the greatest lower bound for the natural norms of a matrix. Since ||A||2 is 
based on the spectral radius, the lg natural matrix norm is also referred to as the 
spectral norm. 
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EXAMPLE 3.10 Calculating the /2 and 1, Norms of a Matrix 
Let’s calculate both the /, and the J,, norms of the matrices 
fh £0: 2 
Aj _ 4 3 and Ag = 0 1 -1 
-1 1 #41 
Starting with the matrix Aj, 
|Ai|loo = max{|1| + | — 2], |4| + |3|} = max{3, 7} = 7. 


To determine the /) norm, we first compute 


7, _{ 1 4] [1 -2] _ [18 1 
ata=| 3, 3 E Alls be rae 
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The eigenvalues of this matrix were previously found to be 3(31 + 5/17). Hence, 


1 
e(AT Ar) = 5(31 + 5V17) and = ||Aq|lz = 4/ =(31 + 5V17) = 5.08013. 


Nl rR 


For the matrix Ae, 


|| Aalloo = max{|1] + 0] + |2),|0| + {1} + | —1],) — 1+ [2] + [4} 
= max{3, 2,3} =3, 


AjAg=|-1 2 0 


The eigenvalues of A? Az were previously found to be 0.89722, 2.85363, and 6.24914. 


Hence, 
p( AZ Az) = 6.24914 and |/Aallg = V6.24914 = 2.49983. 
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EXERCISES 
1. Verify that the Jo9-norm, 
IIXlloo = max |eil, 
satisfies the properties of a vector norm. 
2. Compute the /z-norm and the [oo-norm for each of the following vectors. 
(a)x=[3 -5 v2)" 


(b)x=[2 1 -3 4]” 


(c)x=[4 -8 1]° 
(d) x=[-2V73 -6 4 2]7 
(e)x=[e nm -1 ’ 
3. (a) Show that the function || - ||; :R” — R defined by 
nr 
[Ixll1 = $2 lae| 
i=1 
is a vector norm. The operator || - ||1 is known as the /)-norm. 


(b) Compute the J;-norm for each of the vectors in Exercise 2. 
(c) Show that ||xlloo < |[x|li < nl|x|loo for all x € R”. 
(d) Show that ||x|l2 < ||xl]1 < “7||x|l2 for allx € R”. 


4. Let || - |v be a vector norm. Show that the natural norm associated with |] + |{u 
satisfies || AB] < || Al] ||B|| for al A,B e R?*”. 


5. Compute the spectrum of each of the following matrices. 
4 -2 
eae | ae 
eae 
(his 03 08 


2 <3 1 
© a=[3 ~2 | 


ober 
i ot 
@a-[03 1 
O55 Ae 


6. Compute the /2-norm and the l.o-norm for each of the following matrices. 


(@) A=] 2 A 


(b) A= 


v 1 
wn 
| 


4 

1 

f4 -1 -2 
{(c) A=] 1 2 -3| 

LO 
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2 1 QO 
@ a=] 2 -1 
-3 4 ~4 


7. (a) Prove that the natural matrix norm associated with the /; vector norm (see 
Exercise 3) is given by 


for all AE R”””. This is also known as the column norm of A. 
(b) Compute || - ||; for each of the matrices in Exercise 6. 


8. The Frobenius norm (which is not a natural matrix norm) is defined by 


1/2 


Alle = | $2 SO lass? 


i=1 j=1 


for all AG R™*", 
(a) Show that || - ||- is a matrix norm. 
(b) Compute the Frobenius norm for each of the matrices in Exercise 6. 
9. (a) Let » be an eigenvalue of the matrix A with associated eigenvector x. For 
any integer k > 1, show that * is an eigenvalue of A* with eigenvector x. 
(b) Let A be a symmetric matrix. Show that ||All2 = p(A). 


10. Show that if A is a matrix with p(A) < 1, then the matrix I — A is nonsingular. 
(Hint: Assume that J — A is singular and show this leads to the conclusion that 
A = | is an eignevalue of A.) 


11. (a) Let D be ann x n diagonal matrix. Show that the eigenvalues of D are the 


diagonal elements dii, doo, d33, ..., dnn- 
(b) Let U be ann x upper triangular matrix. Show that the eigenvalues of UV 
are the diagonal elements ui1, u22, U33, ---, Unn- 


3.4 ERROR ESTIMATES AND CONDITION NUMBER 


Having developed the appropriate tools (i.e., vector and matrix norms), we now 
address the questions raised at the beginning of the last section. In particular, how 
much error can we expect when solving a system of linear equations using Gaussian 
elimination, and how does the error depend on the properties of the coefficient 
matrix and the right-hand side vector? Further, when A and b are known only 
approximately, bow is the error in the solution related to the errors in the data? 


Error Estimates 


Suppose X is an approximate solution to the linear system Ax = b, whose exact 
solution is the vector x. In practice, the exact solution to the system is unknown, 
so the error in x, e = X — x, cannot be directly computed. However, the residual 
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vector, which is defined as r = AX — b, can be easily computed. Note that the 
residual measures the amount by which the approximate solution fails to satisfy 
the linear system. When r = 0, it follows that x is the exact solution, so e = 0. It 
seems reasonable to expect, therefore, that whenever ||r|| is small, |lel| will be small 
as well. Unfortunately, this need not always be the case. 


EXAMPLE 3.11 A Small Residual But A Large Error 


1 —2 _|-i 
-0.99 199] *7| 1 
has x = [ 1 oi ih as its exact solution. The vector x = [ -1 0 ile is an obviously 
poor approximation to x: 


The linear system 


e=[-2 -1]” 4 lelleo=2. 


However, the residual associated with x is 
a 1 —2 —1 -1l 
Pee | —0.99 1.99 | | 0 | 7 | 1 | 
snih = -l1]_ 0 
~ | 099} | 1 | | -0.01 }? 


80 |[r|]oo == 0.01. Thus, the error is 200 times larger than the residual. 


The next theorem shows that the norm of both the coefficient matrix and its 
inverse play an important role in the reliability of the residual as a predictor of 
error. 


Theorem. Let A be a nonsingular matrix, X be an approximate solution to 
the linear system Ax = b, r= Ax —b ande=*—x. Then, for any natural 
matrix norm || - ||, 


Ta [rll < llell < A“ Il 


da 
e 1 frrll -llell 


|All [| A7* I] Ibi ~ = iexll = 
provided x #0 and b £0. 


< IAI AT () 


Proof. First we need a relationship between e and r. Combining r = Ax—b, 
b = Ax and e = x — x, it follows that 


r= AX-—b= Ax —- Ax = A(x — x) = Ae. 
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Equivalently, e = A7!r. Now, let || - || be any natural matrix norm. An 
immediate consequence of e = A7!r is 


llell = A~ rl < A7*H eI 


From r = Ae, we obtain 


IIrl| = Ael < IAll lel, 


or lel] > I[rll/|| All. Thus, 


1 
rll < lel < ATM el 2 
rayltl Sllel < ban (2) 
Next, suppose that x #4 0 and b #4 0. Taking the norm of both sides of 
Ax =b yields 
1 |All 
ol = Axl] < [4] | = < lA 
Pl = Wl 
Similarly, from x = A7!b, we obtain 
} 
aaron 
[ATT * el 
7 Al 
1 1 lA 
mane on ae 3 
TTT © Tel * Tl 8) 
Finally, combining (2) and (3) yields 
1 fell — Hell “a lel 
< |All ||A aed CO 
[AAT Hol] S yxy SANA gy 


The inequalities in (1) provide lower and upper bounds on the relative error in 
an approximate solution to Ax = b in terms of the relative residual, ||r||/||b||, and 
the norms of A and its inverse. The quantity «(A) = ||A|| ||A7?]] occurs frequently 
in the analysis of linear systems and is known as the condition number of A. The 
value of «(A} depends heavily upon the matrix norm being used; however, for any 
nonsingular matrix and any natural matrix norm, the following bound applies: 


1=|[Z|| =}.A- A] < [AI A7*]) = 6(4). 


When «(A) is small (i.e., + 1), the relative residual provides a good measure for 
the error in the approximate solution. On the other hand, when «(A) is large, the 
relative residual can be a very poor indicator of the accuracy of the approximate 
solution. 
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EXAMPLE 3.12 A Small Residual But A Large Error, Continued 


The inverse of the coefficient matrix from the previous example, 


1 —2 
a 0.99 1.99 i 


is 
-1 | 199 200 
fe | 99 100 , 


Hence, ||Allos = 3, ||A? loo = 399, and Keo(A) = 3(399) = 1197. The relative error 
in an approximate solution to a system with A as its coefficient matrix can therefore 
be as small as 1/1197 times, or as large as 1197 times, the relative residual. 


Perturbations to A and b 


What if the entries in A and b are known only approximately, due perhaps to data 
errors or roundoff errors or both? How does the error in the computed solution 
depend on the errors in A and b? Let 6A and db denote the perturbations to A 
and b, respectively, and let x-+é6x denote the solution to the system with perturbed 
coefficient matrix and right-side vector. That is, 


(A+6A)(x + 6x) = b+ db. (4) 


Further, suppose that ||6A|| < 1/||A7'||, which guarantees that A+ 6A remains 
nonsingular (see Exercise 3). 

Now, expand the product on the left side of (4), cancel the term Ax on the 
left with the term b on the right (since Ax = b) and rearrange the remaining terms 
to produce 

6x = A" [db — (5A)x ~ (8A) (5x)] - 


Next, take norms, repeatedly apply the consistency property of the natural matrix 
norm and the triangle inequality and solve for ||dx||. The result is 


sxe << AT _ at) + 6) [xD 
<1] 
_ Az [abl] All ) 
= [JAN 4106417141 ( la) * jay 
___ sid) Al, 
= To R(A)IOAT/IAD (Gi + Taq | I). 


Finally, divide through by ||x|| and use the relation ||Al]||x|| 2 ||bl| to obtain 


ex nla) |ab|| , eal : 
Il © 1—«(4)([O4T/ 140) . at 6) 
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Note the presence of the condition number in (5). Even if the relative errors in A 
and b are small, the relative error in the approximate solution may be signficant if 
«(A) is large. 


EXAMPLE 3.13 A Perturbed System 


Let 

1 —2 -1 

a 0.99 1,99 | eae ys 1 i 
Recall that the true solution to Ax=bisx=[1 1 ie and that Ko.(A) = 1197. 
As an experiment, let’s first change the right side vector to b + db, where 

5b = [0.01 0.01 ]”. The true solution of A& = (b+ 4b) is& = [ 4.99 2.99 ]’. 
Thus dx = [ 3.99 1.99 ]7, and 

[|8x\lo0 _ 3.99 


Ixleo ae 3.99. 


Though this constitutes a substantial relative change in the solution, it is quite a bit 
less than the maximum possible change, which for Kao(A) = 1197, ||db||o = 0.01 
and |\blloo = 1 is 

1197 Ga 


So et = 11.97. 
1-1197-0\ 1 +0) vs 


Next, suppose we perturb both A and b with 


re —0.001 —0.001 | ee ee | 0.01 | 


—0.001 ~-0.001 0.01 


The true solution of the system (A + 6A)k = (b+ db) is k= [ 12.910 6.940 es 
Hence, 
|8xlloo 12.91 
\[Xlloo se 
Once again, though quite large, this relative change in the solution is significantly 
less than the maximum possible change, which for Koo(A) = 1197, ||GAlloo = 0.002, 
|Al|o = 8, ||Sb|loo = 0.01, and ||blloo = 1 is 


= 11.91. 


1197 (F 0.002 


1-197 mz TG ) = 6021 


Rounding Errors Introduced by Gaussian Elimination 


Equation (5) was derived without reference to any numerical method for solving the 
linear system; hence, this result represents a fundamental property of the mathe- 
matical problem. Using a process known as backward error analysis (see Wilkinson 
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. [1,2] or Atkinson [3]), it can be shown that the approximate solution to the sys- 
tem Ax = b, obtained by applying Gaussian elimination with pivoting in t-digit 
decimal floating point arithmetic, is the exact solution of the perturbed system 
(A +6A)x = b, where 


k 
maxi, j,1 lag | 
||Alloo 


[Aloo ae 
[Allo 2710 


Here, n denotes the size of the system, and the al? are the elements in the coefficient 
matrix during the kth elimination pass. In practice, f(n) ~ n, and f(n) < L.01(n8+ 
3n*) in the worst case. Wilkinson has observed that the empirical bound 


I|9-Alloo 1-1 
<n-10 
|Alloo 


is seldom exceeded when pivoting is used. Hence, the rounding errors introduced 
by Gaussian elimination would be bounded by 


l8x|loo Keo(A)-n-102-* 
I ne 
IX\loo ~ 1 = Keo(A) + n+ 10!-* 


From here, it is clear that to obtain a “good” solution we must have ko 9{A) much 
smaller than n-10'~*. Further, if xgo(A) = 10” for some r > 0, then we can expect 
to lose roughly r decimal digits of precision in computing an approximate solution. 


Summary 


We've discovered that the condition number of a matrix A, «(A), is central to the 
error analysis of the linear system Ax = b. From a numerical analysis standpoint, 
the order of magnitude of 4(A) provides an indicator for the number of significant 
decimal digits that likely will be lost when computing a solution of Ax = b using 
Gaussian elimination. More fundamentally, the condition number measures the 
sensitivity of the exact solution of Ax = b to changes in the coefficient matrix and 
the right-hand side vector. The larger the condition number, the more sensitive the 
solution. 

Recall that polynoznials whose roots are sensitive to changes in the coefficients 
are called ill conditioned. Similarly, a matrix with a large condition number is said 
to be ill conditioned. But what constitutes a large condition number? The answer 
depends on the floating point number system being used. Specifically, the order 
of magnitude of the condition number needs to be compared to the number of 
significant digits available in the given system. Thus, a condition number of 108 
would be large in IEEE standard single precision which provides only 7 decimal 
digits of accuracy. However, in IEEE standard double precision, which provides 
16 decimal digits of accuracy, a condition number of 10° would not be considered 
large. 
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EXERCISES 
1. 


Let A and B be n x n matrices, and let a be a nonzero real number. 
(a) Show that «(AB) < «(A)«(B). 
{b) Show that «(@A) = (A). 
Let A be an n x n matrix, and suppose that Ax = y for some vectors x and y. 
Show that 
I Ali 


nA) 2 Ty 


. (a) Let A be a nonsingular matrix. Show that if 


|A- BI <1/|a“, 


then B is nonsingular. (Hint: Write B = A~(A-—B) = A(UI-A7!(A-B)), 
and focus on the matrix A~1(A—B). You will need to use Exercise 10 from 
Section 3.3.) 

(b) Let A be a nonsingular matrix and suppose that ||6A|| < 1/\|A7!||. Show 
that A+ 6A is nonsingular. 


. For each of the following floating point number systems, what is roughly the 


Jargest condition number for which the solution to the system Ax = b, com- 
puted in that number system using Gaussian elimination with pivoting, would 
be accurate to ten (10) decimal digits? See Section 1.3 for an explanation of the 
notation. 

(a) IEEE standard double precision, F(2, 53, —1021, 1024) 

(b) Intel extended precision, F(2,64, —16381, 16384) 

(c) HP double extended precision, F(2, 113, -16381, 16384) 

(d) IBM System/390 long precision, F(16, 14, —64, 63) 

(e) IBM System/390 extended precision, F(16, 28, —64, 63) 


. Suppose the matrix A has a condition number of #& 10°. If the system of equa- 


tions Ax = b is solved using Gaussian elimination with pivoting in each of the 
following floating point number systems, how many decimal digits of precision 
can be expected in the approximate solution? 

(a) IEEE standard double precision, F(2, 53, —1021, 1024) 

(b) Intel extended precision, F(2, 64, —16381, 16384) 

(c) HP double extended precision, F(2, 113, —16381, 16384) 

(d) IBM System/390 long precision, F(16, 14, —64, 63} 

(e) IBM System/390 extended precision, F(16, 28, -64, 63) 
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(f) IEEE standard single precision, F(2, 24, —125, 128) 
(g) IBM System/390 short precision, F(16, 6, —64, 63) 


6. Repeat Exercise 5 if the matrix A has a condition number of + 10), 


- Compute oo for each of the following matrices. 


1 2 
(a) A=| 1.001 | 


2.01 1.99 
(b) A= 1.99 2.01 


1 -1l -1 
@ ax[ 0 1 a 


(dyed 


oleate 
Ina] erale = 
CY [cop 


In each of the following problems, a linear system Ax = b is given, along with the 
exact solution, x, and an approximate solution, x. Compute the error e = x-—x 
and the residual r = AX—b and then compare the relative error to the condition 
number times the relative residual. Use the Jo. norm in all cases. Note that the 
coefficient matrices in these problems are the same matrices from Exercise 7. 


(a) | a ; Jx- | oa | 


x=[1 1]? 

x=[3 0]? 

2.01 1.99 4 
o) | 3S aoe P= || 

x=[11]° 

x=[2 0]* 

is tt 0 
© |¢ 1 1 fw | 2) 

0 0 1 0 

=o 2 0]? 

z=[19 21 -01]7 

die shed 1 

a | ene tae 
a 

38 4 =5 30 

x=[1 -2 3] 

x=[ 102 -1.96 2.94 7 
Let 


10. 


ll. 


12. 
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{a) Compute Koo(A). 
(b) Let b= [ 02 1 4 lies and solve the system Ax = b. Now perturb b by 


6b = [ 0.01 0.01 0.01 a and solve the resulting perturbed system. 
Compare the actual value of |]5x||oo/||xlleo with the theoretical upper bound 
predicted by equation (5). 


(c) Repeat part (b), but start with b= [5.5 3.25 2.35 ]7. 
Let 
25 19 
a=|h a 
{a) Compute Keo (A). 


{b) Let b = [ 6 5 as and solve the system Ax = b. Now perturb b by 


éb= [ 0.01 —0.01 i and solve the resulting perturbed system. Compare 
the actual value of ||6x||oo/||X||oo with the theoretical upper bound predicted 
by equation (5). 


(c) Repeat part (b), but start with b = [ 1] |’. 
Let 
0.25 0.385 0.15 0.60 
A= | 0.20 0.20 0.25 and b=] 0.90 |. 
0.15 0.20 0.25 0.70 


(a) Compute ko (A). 
(b) Solve the system Ax = b. 
(c) Perturb the coefficient matrix and right-side vector by 


0.01 0 0 0.01 
0 0 —0.01 and vo=| 0.02 


0 -0.01 0 —0.03 


6A = 


and solve the resulting perturbed system. Compare the actual value of 
||5x||c0/||xlloo with the theoretical upper bound predicted by equation (5). 
(d) Perturb the original coefficient matrix and right-side vector by 


0 —0.01 0.01 0.02 
—0.01 0.01 0 and db= | 0.01 


0.01 0 0.01 —0.01 


6A= 


and solve the resulting perturbed system. Compare the actual value of 
|6x||co/||xl]oo with the theoretical upper bound predicted by equation (5). 


Let 
_[5a 87 _ [9.48 
a=| 34 a and b= | tas | 


(a) Compute Keo(A). 
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14. 
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(b) Solve the system Ax = b. 
(c) Perturb the coefficient matrix and right-side vector by 


—0.001 0 


0.001 0 


tae —0.05 


es 


and solve the resulting perturbed system. Compare the actual value of 
\|5x||co/||X|loo with the theoretical upper bound predicted by equation (5). 


(d) Perturb the original coefficient matrix and right-side vector by 


_ [| 0.001 0.001 Pe 
tam | val and =| | 


and solve the resulting perturbed system. Compare the actual value of 
||6X||co/||X||co with the theoretical upper bound predicted by equation (5). 


Let A be the n x n matrix whose entries are given by ajj = 1/(i +9 — 1) for 

1sajgn. 

(a) For n = 5, solve the system Ax = b using Gaussian elimination with scaled 
partial pivoting in single precision arithmetic. Take b as the vector that 
corresponds to an exact solution of x; = 1 for eachz = 1, 2,3,...,n. Estimate 
«(A) based on the results of this experiment. 


(b) Repeat part (a) with n = 11 and double precision arithmetic. 


Solve the following system in single precision arithmetic. 
—149%, -— 50x22 —-— 15473 = 353 
53721 + 18022 + 54623 = -—1263 
—274, - 92 - 25¢3 = 61 


Use Gaussian elimination with scaled partial pivoting. The exact solution for 
this problem isx = [ -1 -1 -1 ihe Estimate the condition number of the 
coefficient matrix based on the outcome of this experiment. 


Solve the following system in double precision arithmetic. 
-9 11 —21 63 ~—252 —356 
70 —69 141 —421 1684 2385 


-575 575 —1149 3451  —13801 | —19551 
3891 -—3891 7782 —23345 93365 132274 
1024-1024 2048 -6144 24572 34812 


Use Gaussian elimination with scaled partial pivoting. The exact solution for 
this problem is x = [ 1-1 1 -1 1 Ie Estimate the condition number 
of the coefficient matrix based on the outcome of this experiment. 
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3.5 LU DECOMPOSITION 


Suppose we need to solve several linear systems, all with the same coefficient matrix, 
but each with a different right-hand-side vector. If all of the right-hand-side vectors 
are known from the outset, we can place the coefficient matrix and all of the vectors 
into a large augmented matrix. Gaussian elimination with back substitution applied 
to this large augmented matrix would then produce a simultaneous solution to all 
of the systems. 

But what if the right-hand-side vectors are not all known from the outset? For 
example, the solution vector for one system may be the right-hand-side vector for 
the next system. Several methods developed later in the text will work in precisely 
this manner. Although it is the elements in the coefficient matrix which dictate the 
operations to perform during Gaussian elimination, these operations are also carried 
out on the right-hand-side vector. As a result, each time we change the right-hand- 
side vector, exactly the same sequence of operations has to be repeated on the new 
augmented matrix. That’s O(n*) operations repeated again and again. From an 
efficiency standpoint, it would be better to have a solution scheme that treats the 
coefficient matrix and the right-hand-side vector separately, thereby reducing the 
effort which must be expended when the right-hand-side vector is changed. The 
objective of this section is to develop such a scheme. 


LU Decomposition 


Suppose we needed to solve the following cubic equation: z® — 52? + 4r = 0. 
We would start by factoring the cubic polynomial into x(x — 4)(2 — 1) and then 
reducing the original problem into, in this case, three simpler problems: 2 = 0 or 
x-4=0Qorz—1=0. Given the success of this approach, it is natural to ask 
whether it is possible to factor a matrix in such a way that the original problem 
Ax = b can be reduced to solving simpler problems. The answer to this question is 
yes. Fortunately, the resulting solution scheme will also provide us with an efficient 
scheme for handling multiple right-hand sides. 

What structure should we seek for the matrix factors in our factorization of 
the coefficient matrix? In Section 3.1, we saw that a system of equations with 
an upper triangular coefficient matrix is easily solved using back substitution. In 
the solution of systems of linear equations, lower triangular matrices are equally as 
important as upper triangular matrices. 


Definition. The matrix L is called LOWER TRIANGULAR if all elements 
above the main diagonal are zero; that is, if 1,; =0 whenever i < 7. 


A system with a coefficient matrix that is lower triangular can also be easily 
solved. For lower triangular matrices the solution technique is known as forward 
substitution, which is identical to back substitution except that we work from the 
top of the matrix to the bottom. Based on these considerations, we will try to 
factor the original coefficient matrix into the product of a lower triangular matrix 
and an upper triangular matrix—in that order. 
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Given a matrix A, a lower triangular matrix L and an upper triangular ma- 
trix U for which LU = A are said to form an LU decomposition of A. For example, 
because 


1 0 0 1 4 3 1 4 8 
2 1 0 0 -1 3 = 27 9 |, 
5 12 1 0 0 -88 5 8 -2 
the matrices 
: “1 1 0 0 1 4 3 
-L=|]2 1 90 and U=|]0 -1 8 
5 12 1 0 0 —-53 


form an LU decomposition for the matrix 


14 3 
A=|2 7 9 
5 8 -2 
Not every matrix has an LU decomposition (see Exercises 7, 8, and 9), but it 
is possible to rearrange the rows of any nonsingular matrix so that the resulting 
matrix does have an LU decomposition. The algorithm we develop below will 
automatically perform the needed row interchanges. 

When a matrix has an LU decomposition, that decomposition is not unique. 
In fact, when the matrix A has an LU decomposition, there are an infinite num- 
ber of different choices available for the matrices LZ and U. This situation should 
not be surprising considering that between them, the factor matrices have n? + n 
elements to be determined (each matrix has (n? +n)/2 nonzero elements), but the 
matrix A has only n? elements. The problem of computing L and U is therefore 
underdetermined. 

Though the LU decomposition process does not uniquely determine L and U, 
the pairs of matrices that form different decompositions of the same matrix are 
related. For if A = 2,0, = L2U2, where L, and Lg are lower triangular matrices 
and U; and U are upper triangular matrices, then it follows that 


by hy = UsUy 
The matrix on the left-hand side of this equation is lower triangular, while the 
matrix on the right-hand side is upper triangular. For these two matrices to be 


equal, they must be equal to a diagonal matrix, call it D. Thus the matrices that 
form different LU decompositions for the same matrix must be related by 


Ly = L2D and Us = DU, 


for some diagonal matrix D. Hence, we say an LU decomposition is unique up to 
scaling by a diagonal matrix. 
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EXAMPLE 3.14 Multiple LU Decompositions 


We've already established that the matrix 


has an LU decomposition that consists of the matrices 
ti, 2000 oe ne 
Dy = 2 1 0 and Uy = 0 -1 ; 3 i 
5 12 1 0 -0 —853 


Another LU decomposition of A consists of the pair 


1 0 0 1 4 
Lg = 2 -l 0 and U2 = 0 1 -3 ; 
5 —12 ~53 0 0 


which can be verified directly by multiplication. In this case, 
Uy = DU» and Le = L,D, 


where the diagonal matrix 


Yet another LU decomposition of A consists of the pair 
1/2 0 0 2 8 6 
Lg = 1 -1/3 0 and U3 = 0 3 -9 : 
5/2 -4 53 00 -1 


Here, U3 = DU2 and Lz = 13D, where the diagonal matrix 


20 0 
03 O |. 
QQ 0 -l 


What is the diagonal matrix which relates the pair L,; and Uj to the pair L3 and U3? 


D= 
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Obtaining an LU Decomposition 


Different approaches can. be taken for computing an LU decomposition. One gen- 
eral approach is known as direct factorization. In this approach, we would write 
out the n? equations for the n? +n entries in L and U implied by the matrix equa- 
tion LU = A. These n? equations would then be supplemented with n auxiliary 
conditions (such as requiring all of the diagonal entries of the matrix U be equal 
to one) so that the problem will be well defined. Finally, the calculations would 
be organized so that the system could be solved as efficiently as possible. We will 
defer a detailed discussion of direct factorization until the next section. 

Here, we will focus on modifying Gaussian elimination to produce an LU 
decomposition. The key to making this modification is recognizing that Gaussian 
elimination can be represented as a sequence of matrix multiplications. To see how 
this comes about, consider the following matrix multiplication: 


100 0 G@ G2 O3 a 

010 0 by bo bg hy 

02 1 0 C, C2 C3 C4 

03 0 41 d; dy d3 dy 
ay a2 a3 Q4 
by bg bg b4 


i 


Cyt 2b, Cot 2b2 c34+2b3 cq + 2b4 
dy +3b, do+3bg dg+3b3 dg +3bq 


Careful examination of the product matrix reveals that premultiplication by the 
matrix 


Qor 
NH ® 


0 3 


has carried out the elementary row operations r3 — R3+2Re2 and rg — Ry+3Ro. 
Note that the multiples of row 2 that have been added to rows 3 and 4 are exactly 
the entries in row 3, column 2 and row 4, column 2 of the premultiplying matrix, 
respectively. Generalizing this result, it follows that the th pass of Gaussian elimi- 
nation is equivalent to premultiplication of the coefficient matrix by the matrix Mj, 
which is the identity matrix with the zero entries below the diagonal in the ith 
column replaced by mj; (7 =t+1,1+2,7+3,...,n). Recall that m,, is just the 
multiple of row i needed to generate a zero in row j of column i. Thus, assuming 
that no row interchanges were necessary, the entire Gaussian elimination process is 
given by Mn—1Mn—-2Mn-3-°:M3M2M,A=U. 

Each of the matrices M; is nonsingular, so we can solve the matrix repre- 
sentation of Gaussian elimination for the matrix A, yielding A = M;*M,1M," 
-- Mz1,U. It is straightforward to show that M;" is given by the identity matrix 
with the zero entries below the diagonal in the ith column replaced by —m;,; (j = 
i+1,74 2,1 + 3,...,n)—see Exercise 3. Carrying out the multiplication of the 
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M>', we find that 


t 


} 
—m2,1 1 
—Mm3,) —M3,2 1 
2g gS ha M41 —™M42 —M43 1 
My Mz °M3°---M)2, = : ’ ; 
—Mn1  ~™Mn,2 —Mn,3 " st =Myn-1 1 


which is a lower triangular matrix. Note that because of its special structure, we 
can obtain this matrix with no additional arithmetic. 

Hence, the U matrix produced by Gaussian elimination is the upper trian- 
gular matrix of an LU decomposition, where the lower triangular matrix has ones 
along the main diagonal and —m,, in the locations below the diagonal. Since the 
elements along the main diagonal of L are always equal to 1, there is no need to 
store these elements explicitly. We only need to record the —m,,;, which can be 
conveniently and efficiently done by overwriting the elements which are being set 
to zero. This amounts to changing the line “set downass = 0” in the Gaussian 
elimination algorithm to “set drow,pass = —™.” 

To this point our development has assumed that no row interchanges would 
be necessary. However, we know that row interchanges are sometimes necessary 
to avoid a zero pivot and are usually necessary as part of a pivoting strategy to 
reduce the effects of roundoff error. How do these row interchanges affect the LU 
decomposition process? Let’s do an example and find out. 


EXAMPLE 3.15 Determining an LU Decomposition 


Let’s determine an LU decomposition for the matrix 


14 8 
A=|}2 7 9 
5 8 -2 


using Gaussian elimination with scaled partial pivoting. ‘The scale vector associated 
with the matrix A is given by 


s=[4 9 8], 
and we initialize the row vector to 
r= [ 1 2 3 Is 
Examining the ratios 


lary 2] 2 : |Qry.a} as aed |ars 1 _38 
Sry 4 Srp Sry 


196 Chapter 3 Systems of Equations 


we find the largest value corresponds to row 73, s0 we need to swap the first and 
third elements in the row vector. Thus, for the first elimination pass, we have 


r=(3 2 1]. 
Following this elimination pass, the contents of the matrix are 


(1/5) 12/5 17/5 
(2/5) 19/5 49/5 
5 Bo 4g 


Note how the opposite of each multiplier overwrites the element which is being set 
to zero. To distinguish the multipliers from the other elements in the matrix, the 
multipliers are displayed within parentheses. 
To determine the location of the next pivot, we examine the ratios 
19/5 
larsial 19/5 19g tral _ 12/5 _ 3 
Sr, 9 45 Sry 4 5 

The largest of these corresponds to row 73, s0 we swap the second and third elements 
in the row vector, which becomes 


r=(3 1 2]. 
Following the second, and final, elimination pass, the contents of the matrix are 


(1/5) 12/5 «17/5 
(2/5) (19/12) 265/60 
5 8 =9 


To identify the Z and U matrices in the decomposition, we first need to read 
the rows of the final matrix in the order indicated by the row vector; that is, start 
with the third row, then the first row, and finally the second row. The upper 
triangular matrix in the decomposition is then obtained by setting the elements 
below the rain diagonal to zero. The lower triangular matrix is obtained by setting 
the elements along the main diagonal to 1 and the elements above the diagonal to 
zero. Therefore, we find 


1 0 0 5 8 —2 
L=| 1/5 1 0 and U=|0 12/5 17/5 
2/5 19/12 1 0 0 265/60 
Note that if we multiply the matrices L and U, we obtain 
5 8 -2 
IvU=|]1 4 3 |, 
27 9 


which is not equal to the matrix A. The rows of LU are the rows of A, but listed 
in a different order. Observe, in particular, that the rows of LU are the rows of A 
listed in the order indicated by the final row vector. 


a 


Section 3.5 LU Decomposition 197 


To clarify the outcome of this last example, let 
0 01 
P=|]1 0 0 
0 1 0 


‘This matrix was obtained by taking the 3 x 3 identity matrix and reordering the 
rows according to the contents of the row vector r = [ 3 1 2 ie 
multiply P into the matrix A, we obtain 


If we now 


which is equal to the product LU calculated above. Hence, with row interchanges, 
we have found an LU decomposition for the matrix PA. 


A matrix such as P, which is an identity matrix with its rows reordered, 
is called a permutation matric. Thus, when row interchanges are used, the LU 
decomposition we calculate will not be for the original matrix A, but will be for 
the matrix PA, for some permutation matrix P. The specific permutation matrix 
can be found by reordering the rows of the n x n identity matrix according to the 
final contents of the row vector r. 


Solving a Linear System using an LU Decomposition 


Suppose we need to solve the linear system Ax = b, and we have already found a 
lower triangular matrix L and an upper triangular matrix U such that LU = PA 
for some permutation matrix P. If we multiply the linear system by P and 
then substitute LU for PA, we find that solving the original system is equiva- 
lent to solving LUx = Pb, or L(Ux) = Pb. Now, let z = Ux. This trans- 
forms what had been one problem, solve Ax = b for x, into a sequence of two 
problems: Solve Lz = Pb for z and then solve Ux = 2 for x. These two sub- 
problems, however, are easy to solve as a result of the structure we imposed on 
the matrices 2 and U. Forward substitution applied to Lz = Pb produces the 
vector z, and then back substitution applied to Ux = z gives the solution vec- 
tor x. 

It is important to note that to carry out this solution process, we don’t need 
to explicitly construct the permutation matrix P and form the matrix-vector prod- 
uct Pb. The matrix P is completely determined by the final contents of the row 
vector r generated during the LU decomposition process. Furthermore, multipli- 
cation by P merely rearranges the rows of b. Therefore, to carry out the solution 
process, we simply need to know the row vector r and access the rows of b through 
the row vector. 
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ae 
EXAMPLE 3.16 Demonstration of Solution Pracess Based on LU 


Decomposition 


Consider the linear system 


1 4 83 —4 
27 9 x= ~—10 
5. 8 -2 9 


Above, the LU decomposition process applied to the coefficient matrix for this 
system produced the matrix 
(1/5) 12/5 17/5 
(2/5) (19/12) 265/60 
5 8 —2 
and the row vector x 
eal i cae) eae 
Performing forward substitution, the intermediate vector z is computed as 
follows: 


2 = by, = b3 = 9; 


29 = Ory — bry 121 = by — dz, = —4—- 5 (9) = 


and 


23 = Org — beg 21 — leg 922 = be — Laz — leoze 


2 19 29 265 
=-10-309)-75(-F) = 


The notation [;,; refers to the elements in the lower triangular matrix of the LU 
decomposition, as stored in the matrix shown above. 

Back substitution now determines the solution to the original system of equa- 
tions: 


_ 23 —265/60 
3 oa wa 265/60 
po ss. BS ~29/5 — (17/5)(=1) _ =F 
: Urg,2 U1,2 12/5 
and 
éo= 2 — Ur, 202 — Ur, 323 21 — U3,2%2 — 3,343 
Ur, 1 U3,1 
_ 9-8-1) - (2-0) _, 
5 


Hence, x=[3 -1 -1 ie 
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Summary and Comparison 


In this section, we’ve developed a two-step algorithm for solving a linear system of 
equations. The first step, known as the factor step, determines an LU decomposi- 
tion for the coefficient matrix of the system. The coefficient matrix is the only input, 
and the output consists of the LU decomposition overwritten onto the original con- 
tents of the coefficient matrix and a row vector which indicates the final ordering of 
the rows. The second step of the algorithm, known as the solve step, takes the LU 
decomposition, the row vector, and the right-hand-side vector as inputs and then 
performs forward substitution followed by back substitution to produce a solution 
vector. 

For a linear system with n equations in n unknowns, the factor step has a 
computational cost of 2n ~ in? — 4n arithmetic operations. This is slightly lower 
than the cost of Gaussian elimination because we’ve moved the processing of the 
right-hand-side vector to the solve step. The solve step then requires 2n? — n 
arithmetic operations, which is slightly higher than the cost of back substitution. 
Our two step algorithm therefore has a total cost of 3n?+3n?—n operations, which 
is identical to the cost of solving a single linear system using Gaussian elimination 
with back substitution. 

What if we have multiple systems, all of which have the same coefficient 
matrix? Specifically, suppose we need to solve m systems of n equations in n 
unknowns. Using the factor and solve algorithm, the factor step would be performed 
once, at a cost of 3n3— in? —1n operations. The solve step would then be repeated 
for each of the m right-hand-side vectors, at a cost of 2mn?—mmn operations. Thus, 
the total cost for solving all m systems is 3n°+ (2m — 3) n? — (m+ 4) n arithmetic 
operations. 

If all m right-hand-side vectors are known from the outset, we can construct 
ann x (n-+m) augmented matrix and perform simultaneous Gaussian elimination 
with back substitution. The computational cost of this algorithm is identical to 
that of the factor and solve algorithm (see Exercise 2). However, if the right- 
hand-side vectors are not all known from the outset, then performing Gaussian 
elimination with back substitution sequentially on all m systems incurs a cost of 
2mn' + 3mn? — imn, which is substantially higher than the cost of the factor and 
solve algorithm. Thus, in this case, the two-step algorithm is superior to Gaussian 
elimination with back subsititution. ; 

Continuing to consider the case of multiple linear systems, all with the same 
coefficient matrix, what about computing A7! followed by m matrix-vector multi- 
plications as a solution algorithm? Each multiplication A~*b has a cost of 2n? ~n 
operations, which is identical to the cost of the solve step. However, computing 
A7} is nearly three times more expensive than determining an LU decomposition. 
We can therefore conclude that when we have multiple systems, all with the same 
coefficient matrix, and the right-hand-side vectors are not all known in advance, 
the factor and solve algorithm is the most efficient solution scheme. 
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An Application Problem: The Inverse Power Method 


In Chapter 4, we will study techniques for approximating the eigenvalues and eigen- 
vectors of an arbitrary matrix. One of the techniques we will study is called the 
inverse power method. This is an iterative technique that requires an initial esti- 
mate for an eigenvalue, Ao, and a nonzero vector, x, as input. In each iteration, 
the following calculations are made: 


x!) = (A — AoI)71x'*™-Y 
dr {m) 


m= Fin 


x) = OP fe, 


The quantity Ao +(1/Am) converges toward the eigenvalue of A that is closest to Ao, 
and x‘™ converges toward a corresponding eigenvector. The integer p, is chosen 
so that [ese = ||x!™ |... In implementing this algorithm, we will not compute 
(A — Aol)7+; it is more efficient to solve the system (A — AoI)x(™ = x'™-) for 
x(™). Since the matrix A— Apo does not change from iteration to iteration, we can 
perform an DU decomposition once and use it in each iteration. 

As an illustration, let 


1 -1 0 
A=] -2 4 -2 
0 -1 2 


This matrix has an eigenvalue near 5. (To ten decimal places, the eigenvalue is 
5.1248854198.) Take Ap = 5, so that 


-4 -1 0 
A-Al=] -2 -1 -2 ], 
0 -1 -3 


and let x = [1 -4 1 ie 
decomposition for A — Ao! is 


With this vector, note that p = 2. An LU 


1 0 0 ee i. sf) 
LE A7e 1.0 and U=| 0 -1/2 -2 
0) 21 Cr rn 


For the first iteration of the inverse power method, forward substitution ap- 
plied to Lz = x) givesz = [1 -9/2 10 12. Back substitution on Ux) = z 
then gives x) = | 7.5 —31 10 ile With po = 2, we find »; = a) = —3] or 
Ao + 1/A; 4.9677. Finally, since p, = 2, we set 


ay 2 x (1) 


ia 


[ -15/62 1 10/31 


to prepare for the next iteration. 
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Forward substitution applied to Lz = x"! gives 


z= (1/124)[ -30 139 -318 |”, 


which leads to x) = (1/124) [ -241 994 —318 is when back substitution is 
applied to Ux?) =z. It follows that Ag = 2 = 994/124 and Xo + 1/dg ~ 5.1247, 
which is already correct to three decimal places. The approximate eigenvector is 


x) == =| —241/994 1 318/904 |”. 


We leave it as an exercise to perform the next couple of iterations. 


EXERCISES 


1. 


(a) Show that the algorithm to obtain an LU decomposition based on Gaussian 
elimination requires ns — bn? - an arithmetic operations. 
(b) Show that the solve step—forward substitution followed by backward sub- 


stitution—requires 2n* — n arithmetic operations. 


(c) Suppose A~? has been calculated. Show that the multiplication A7'b re- 
quires 2n* — n arithmetic operations. 


Let A be an n x n matrix, and suppose that we need to solve m linear systems 
Ax = b, for i = 1,2,3,...,m. Consider constructing an 7 x (n +m) augmented 
matrix that contains all of the right-hand-side vectors and performing Gaussian 
elimination with back substitution on this matrix. Show that this algorithm 


requires 2n3 + (2m — 5) n? — (m+ §) n arithmetic operations. 


. Show that 
Te iG? 062050] = be SE) 20° OB 
0) 1 0 0 0 0 L 00 0 
0 m32 1 0 0 = 0 ~m32 1 0 0 
0 maz 0 1 O Q -m4a2 0 1 0 
0 m2 0 0 1 0 -msz 0 0 1 
Let 


aie ae 
ae | 3 4 
Verify that each of the following pairs forms an LU decomposition of A, and 
then use the decomposition to solve the system Ax = [ 4 6 we 


@o=[3o|.a=[5 2) 


202 


Chapter 3 Systems of Equations 


2 7 5 
a=] 20 | 
4 3 0 


5. Let 


Verify that each of the following pairs forms an LU decomposition of A, and 
then use the decomposition to solve the system Ax = [ 04 1 ie 


1 0 0 2 7 5 
(a) Ty = 3 1 0 u=[o -1l -5 


2 11 41 0 0 45 
1 0 0 27-5 
(ob) Lo=|]3 -1 0), U=l]o 1 5 
2 —-li 45 00 1 
2 0 0 1 7/2 5/2 
(c) fg=]6 -1 Of}, Ug=|]O0 1 5 
4 -ll 45 0 9 1 
» Let 
13 1 ~2 
2 4 -1 2 
AS 3k lL Bi) 
4 2 -1 6 


Verify that each of the following pairs forms an LU decomposition of A, and 
then use the decomposition to solve the system Ax = [ 3 7 10 11 ie 


1000 a ae ee 
“2, [oe tO i ae ee 
(ah Pa = |e 4 19 |e 748 |o. @ ao: 298 
a ae 00 0 -3 
10 0 06 131 -2 
2-1 0 0 023 -6 
UB) EL Pg. wed: ag 4.9). | ONO. 1 on 710 
4 SB 4G 238 00 0 4 


. (a) Show that the matrix 


i 


has no LU decomposition. (Hint: Write out the equations corresponding to 


fi 0 U1 U2 0 1 
loi loa 0 lag 11 
and show that the resulting system is inconsistent. ) 


(b) Reverse the order of the rows of A and show that the resulting matrix does 
have an LU decomposition. 


8. (a) Show that the matrix 
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has no LU decomposition. 
(b) Rearrange the rows of A so that the resulting matrix does have an LU 
decomposition. 


9. Repeat Exercise 8 for the matrix 


1 1 
a=] 1 
-1 0 


10. Consider the matrix 


(a) Find a lower triangular matrix D with ones along its diagonal and an upper 
triangular matrix U such that A = LU. 


(b) Find matrices L, D and U such that A = LDU, where L is a lower triangular 
matrix with ones along its diagonal, D is a diagonal matrix, and U is an 
upper triangular matrix with ones along its diagonal. 

(c) Find a lower triangular matrix Z and an upper triangular matrix U with 
ones along its diagonal such that A = LU. 


For Exercises 11-15, 


(a) Using scaled partial pivoting during the factor step, find matrices L, U and P 
such that LU = PA. 


(b) Solve the system Ax = b for each of the given right-hand-side vectors. 


ee ones ae 10 7 r 4 -2 
ee a 5 —5 -3 
11. A= 1 Sf 1 9 bi) = 3 ’ bz = 3 , b3 = 1 
-1 1 -1 5 4 | | —4 =8 
1 0 2 0 3] f -1 3 
-1 4 3 6 12 _ | -6 _ | -8 
TARR Qi: tom agy (eg |) BES sg llid: BES | suit Sal, 46 
3 1 1 0 5 | led 2 
13 1 -2 1 —5 5 
DAR 2 5 -3 | 5 
18.A=/3 1) 1 5 Bae] ay | RS | gs a SBS 
4 2 -1 9 —5 1 
2° 7% 35 14 -4 -8 
14,.A4=] 6 20 10] b,=| 36], be=|] -16], bz3=| -12 
4 3 90 7 -7 6 
13 39 2 587 28 53 57 
-4 ~-12 0 ~-19 -9 18 ~18 
15.A={ 3 0 -9 2 1 |b=] -7 |,bo=] —11 |, 
6 17 9 5 7 0 18 
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16. 


17. 


—145 
49 
b3 = | —27 
~4 
—286 
In the text, the Inverse Power Method, a technique for approximating the eigen- 
values and eignevectors for an arbitrary matrix, was described. Given an initial 


estimate for the eigenvalue, Ao, and a nonzero vector, x , the following sequence 
of calculations are iterated: 


x6) = (A= AgI)~tx(™—D 
(m2) 


Am = bmn) 
xi™) — x0) pf) 
The quantity Ag + (1/Am) converges toward the eigenvalue of A that is closest 
to Ag, and xh) converges toward a corresponding eigenvector. The integer pm 
is is chosen so that |s?| = [|x |Joo. 
For the remainder of this exercise, let 
1 -1 0 
—2 4 -2]. 


0 -1l 2 


A= 


(a) For Xo = 5 and x! =[ 1 -4 1 ]”, we found that Ap = 994/124 and 


x) = [ —241/994 1 ~318/994 ike Perform the next two iterations. 
How does the value \g+(1/Aq4) compare to the true eigenvalue 5.1248854198? 


(b) For Ap = 2 and x! = [P-L 2 ie perform the first four iterations of 
the inverse power method. How does the value Ao + (1/A4) compare to the 
true eigenvalue 1.6366717621? 

Determine the member and reaction forces within the plane truss shown in Fig- 

ure 3.5 when the truss is subjected to each of the following loading configura- 

tions. 

(a) 500-pound forces directed vertically downward at nodes #3 and #5, and a 
1000-pound force directed vertically downward at node #4. 

(b) 500-pound force acting at node #3, a 1000-pound force acting at node #4, 
and a 1500-pound force acting at node #5, all forces acting vertically down- 
ward. 

(c) 1500-pound force acting at node #3, a 1000-pound force acting at node 
#4, and a 500-pound force acting at node #5, all forces acting vertically 
downward. 

(d) 500-pound force acting at node #4, and a 1000-pound force acting at node 
#3, both forces acting horizontally to the right. 

(e) 500-pound force acting at node #4 and a 1000-pound force acting at node 
#£5, both forces acting horizontally to the left. 
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| steer | steer | isfeet | ister | 


Figure 3.5 Figure for Exercise 17. 


3.6 DIRECT FACTORIZATION 


In the previous section we saw how Gaussian elimination cowd be used for deter- 
mining the LU decomposition of a nonsingular matrix. Direct factorization is an 
alternative procedure for obtaining an LU decomposition. Why do we need another 
technique for calculating an LU decomposition? On the one band, there are matri- 
ces which have special structure to them. Direct factorization will make it possible 
to construct schemes that take advantage of that structure. On the other hand, 
the formulas associated with direct factorization will allow us, on some computers, 
to take advantage of architecture to improve both speed and accuracy. 


Direct Factorization 


Given a matrix A, recall that the objective of an LU decomposition is to determine 
a lower triangular matrix Z and an upper triangular matrix U such that LU = A. 
This matrix equation is a shorthand for a system of n? equations, assuming that 
Ais ann xn matrix, for the n? +n nonzero entries in the matrices L and U. To 
produce a well-posed problem, we must specify n additional equations. 

Recall that the factors in an LU decomposition are determined only up to a 
scaling by a diagonal matrix. Therefore, different factorizations may be viewed as 
resulting from different choices for the diagonal elements of either Z or U. The two 
most common choices for the diagonal entries are 


lj; = 1 for each i = 1, 2, 3,..., 7; and 
ui = 1 for each 1 = 1, 2, 3,..., n, 


which give rise to what are known as the Doolittle decomposition and the Crout 
decomposition, respectively. When implementing a direct factorization algorithm, 
the most important issue is the efficient organization of the calculations. To demon- 
strate the proper sequence of calculations, let’s focus on the Crout decomposition. 
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The Doolittle decomposition proceeds in a similar manner and will be considered 
in the exercises. 

Let A be an n x 2 matrix. To obtain the Crout decomposition of A we must 
determine the entries 1,3 (¢ >) and wiz (¢ <j) such that 


hu 1 uy 3 Urn 

lor lag Lo Rie: o>, Laie 

lg, ga dag 1 U3n 

Ini tna ln lan 1 
@1 G12 13 0° + + Ain 
Q21 G22 423 > - + Gan 
a31 G32 G33 ° °° G3n 
Qnl Gn2 Qn3 * + * Qnn 


As with Gaussian elimination, we will organize our calculations into passes. In 
total, ann x m matrix will require n passes. 

For the first pass, note that the first column of U contains a single nonzero 
entry, a 1 in the first row. Therefore, the product of the ith row of L (for i = 1, 2, 
3, ..., 2) with the first column of U is simply the element 1,,. The decomposition 
equation requires that this value be equated to a;; that is, 


li = an. 


These equations determine the first column of L. Now that the ly, entry is known, 
multiplying the first row of L with the jth column of U (for j = 2, 3, 4, ..., n) and 
equating the result to a,; produces the equation l),u1; = a1;. Dividing by li, we 
find 


Uy = Oig/lin, 


thus determining the first row of U. 

Each subsequent pass of the algorithm computes one more column of LZ and 
one more row of U. In particular, the kth column of L and the kth row of U are 
determined during the kth pass. To compute the elements l;, (for? = k, k+1, 
k+2,..., 2), form the product of the éth row of L with the kth column of U, and 
equate that to ay,. The resulting equation is 


k-1 


lin + ligase = Gin, 
j=l 
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whose solution for l;, is 


k-1 
lik = Qik — So kitje- (1) 
j=l 
With lx now known, the elements u,; (for 7 =kA+1,k+2, k+3,..., n) are 


found by equating the product of the &th row of L and the j-th column of U to the 
element a,;. The formula for up; is 


1 k-1 
Uks — es (a = ts Fi (2) 
i=1 


Note that for the final pags, only Inn is left to be determined, so the formula for 
uz; does not need to be applied. 


EXAMPLE 3.17 Crout Decomposition in Action 


Consider the 3 x 3 matrix 


14 3 
A=|/2 7 9 
§ 8 -2 
The Crout decomposition of this matrix will consist of the matrices 
fi 0 0 1 wig ths 
L= lo) log 0 and — 0 1 U23 
lg, l32 Ugg 0 0 1 


Since A is a 3 x 3 matrix, it will take three passes to compute all of the entries 
in L and U. Following the general description provided above, in each pass, we will 
compute the elements in one column of FE and in one row of U. 

Forming the product of each row of L with the first column of U and equating 
the result with the corresponding element from A determines the elements in the 
first column of L: 

li =] lay =2 and Igy =5. 


The first row of U is obtained by multiplying the first row of Z with the second and 
third columns of U and then equating the result with the corresponding element 
from A. This yields the equations 


fuiuiw=4 and dys = 3, 


whose solutions are 
w2=4 and ug=s. 


Thus, after the first pass, the Crout decomposition looks like 
1 0 0 fie 2 


L= 2 Igo 0 and U= 0 1 U23 
5 32 Igg 00 1 
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For the second pass, we multiply the second and third rows of L with the 
second column of U. Equating each product with the corresponding element from A 
generates the equations 


loita2 tle =7 and — Ig;uy9 + Igo = 8. 


Substituting the values determined during the first pass and solving for the elements 
in the second column of L gives 


loo =-l and Igo = —-12. 


Next, we multiply the second row of L into the third column of U to derive the 
equation 


lnitaa + lauag=9 or — (2)(8) + (~L)uras = 9. 


Solving for ugg, we find u23 = —3. Thus, after the second pass, the Crout decom- 
position looks like 


1 0 0 14 3 
L=|/|2 -1 0 and U=]0 1 -3 
5 -12 Iss 00 1 


All that remains is the computation of /33. Multiplying the third row of L 
and the third column of U generates the equation 


lg, uy3 + Egqte3 + lg3 = —2. 
Substituting the values determined from the previous passes, we find 
lgg = —53. 


Thus, the complete Crout decomposition is 


1 90 0 14 8 
L=]2 -1 0 and U=]0 1 - 
5 —-12 —83 00 1 


Observe that each element a;; from the original matrix appears in only one 
‘factorization equation. In particular, a;; appears in the equation for the element J;, 
whenever 7 > j and in the equation for the element u;; whenever 7 < j. Hence, as 
each new factorization element |;; or uj is computed, it can be stored in the location 
previously occupied by a,;. The same will be true for Doolittle decomposition. 
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Direct Factorization versus Decomposition via Gaussian Elimination 


In terms of general performance, direct factorization and decomposition via Gaus- 
sian elimination are identical. First, both schemes can be carried out “in place.” 
By overwriting the entries in A, no additional storage is needed to carry out either 
algorithm, an important benefit when the A matrix is large. Second, both schemes 
have exactly the same computational cost of 


arithmetic operations for an n x n matrix. 

Direct factorization does offer certain advantages over Gaussian elimination 
when the matrix has special structure. This issue will be explored in detail in the 
next section. Furthermore, note that the direct factorization equations, (1) and (2), 
for computing J; and uz; involve inner products of vectors (i.e., multiplication of 
a row vector into a column vector}. These calculations can be carried out in a 
separate routine that employs higher precision to accumulate the inner product 
(thereby improving accuracy) and/or that exploits machine architecture, such as 
vectorization (thereby improving speed). 


EXERCISES 


In Exercises 1-6, determine the Crout decomposition of the given matrix, and then 
solve the system Ax = b for each of the given right-hand-side vectors. 


p2 7 5 0 -4 =3 
1.A4=|] 6 2 | bi=| 4], be=]| 16 |, w= | 2 
l4 3 0 1 -7 6 
Pelt ihe r 3 -~9 ty 
2,A4=| ~1 0 2 | bi = al bo = “|, w=] a 
bid 2 OSL 4 7 0 
p-3 2 — , 7 —12 17 
3 A = 6 8 1 | by = 3 ? be = 1 7 bg = | —-19 
ee ~33 1 ~35 
pol 4 5 Ff —15 —10 —21 
4.4=| 2 6 4| bi=| 14], be=| -10 | .- he || 214 
| -l1 -2 3 —7 —10 —17 
fol 2 3.464 10 —4 —2 
-l 1 2 3 5 _| 75 _| 73 
5. A = 1 —] 1 2 bi = 3 ’ be = 3 . bz = 1 
or 4 -4 -8 
13 1. +2 1 -9 5 
Pbeoh ide Take 2 = | =8 5 
6 A=|3 54 5 b=] 9], be=| @ |, b=] 2, 
4 2 -l 9 -5 1 
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. Show that computing the Crout decomposition of an n x n matrix requires 


4p? ~ an arithmetic operations. 


matrix. 


2.0 
30 


. (a) Construct an algorithm to compute the Doolittle decomposition of an n x n 


(b) Show that computing the Doolittle decomposition of an nxn matrix requires 


gr’ — 5n° — gr arithmetic operations. 


In Exercises 9-14, determine the Doolittle decomposition (see Exercise 8) of the given 
matrix, and then solve the system Ax = b for each of the given right-hand side-vectors. 


9. 
10. 
11. 
12. 
13. 
14. 
15. 


Use the matrix and right-hand-side vectors from Exercise 1. 
Use the matrix and right-hand-side vectors from Exercise 2. 
Use the matrix and right-hand-side vectors from Exercise 3. 
Use the matrix and right-hand-side vectors from Exercise 4. 
Use the matrix and right-hand-side vectors from Exercise 5. 
Use the matrix and right-hand-side vectors from Exercise 6. 


(a) Construct an algorithm to factor an n x n matrix into the product LDU, 


{e) 


where L is a lower triangular matrix with ones along its diagonal, D is a 
diagonal matrix, and U is an upper triangular matrix with ones along its 
diagonal. 

Suppose the matrix A has been factored into the product LDU, where the 
matrices L, D, and U have the form specified in part (a). Construct an 
algorithm to use this factorization to solve the system Ax = b. 

How many arithmetic operations are required to compute the factorization 
in part (a)? How does this total compare to the number of operations needed 
to compute an LU decomposition? 

How many arithmetic operations are required by the algorithm in part (b) 
to solve a systern given an LDU decomposition of the coefficient matrix? 
How does this total compare to the number of operations needed by forward 
and backward substitution? 

How does the total number of arithmetic operations needed to solve a sys- 
tem of equations using an LDU decomposition compare to the number of 
operations needed to solve a system using an LU decomposition? 


In Exercises 16-21, determine the LDU decomposition (see Exercise 15) of the given 
matrix, and then solve the system Ax = b for each of the given right-hand-side vectors. 


16. 
17, 
18. 
19. 
20. 
al. 


Use the matrix and right-hand-side vectors from Exercise 1. 
Use the matrix and right-hand-side vectors from Exercise 2. 
Use the matrix and right-hand-side vectors from Exercise 3. 
Use the matrix and right-hand-side vectors from Exercise 4. 
Use the matrix and right-hand-side vectors from Exercise 5. 


Use the matrix and right-hand-side vectors from Exercise 6. 
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3.7 SPECIAL MATRICES 


Linear systems which arise in practice often have coefficient matrices that have 
special properties or structure. Additionally, many numerical methods involve the 
construction and solution of linear systems with coefficient matrices that have spe- 
cial properties or structure. In this section, we will discuss three important classes 
of special matrices which arise frequently. 


Strictly Diagonally Dominant Matrices 


The first class of special matrices that we will discuss is the strictly diagonally 
dominant matrices. 


Definition. An nxn matrix A is STRICTLY DIAGONALLY DOMINANT if, for 
each row, the magnitude of the diagonal element is strictly larger than the 
sum of the magnitudes of the other elements on the row; that is, if, for each 7, 


n 


las| > > |ax3| - 


j=l ies 


As an illustration, consider the two 3 x 3 matrices 


3 -1 1 30-2 2 
2 -6 3 and 2 -6 3 
-9 7 20 -9 7T ~20 


The first matrix is strictly diagonally dominant since 


3) =3>2=|-1] +1], 
|-6)=6>5=|2|+ 3), and 
| — 20] = 20 > 16 =|—9|+|7I. 


The second matrix, however, is not strictly diagonally dominant since in the very 
first row 
|3) =3 <4=|—2]+4 [2]. 


The following theorem indicates three of the most important characteristics 
of strictly diagonally dominant matrices. 


Theorem. Let A be a strictly diagonally dominant matrix. Then 


1. Ais nonsingular, so Ax = b has a unique solution for any right-hand- 
side vector b; 

2. Gaussian elimination and direct factorization can be performed on A 
without row interchanges; and 

3. The calculations during Gaussian elimination and direct factorization 


are stable with respect to the growth of roundoff error; that is, no piv- 
oting strategy needs to be applied. 


212 Chapter 3 Systems of Equations 


Proof. (1) Let A be an nxn strictly diagonally dominant matrix, but suppose 
A is singular. Then there exists a non-zero vector x such that Ax = 0. Let i 
be an index for which |z;| = ||x||o0. Now, focus on the 4th component of the 
vector equation Ax = 0, which reads 


na nr 

>. osy =O or aya; t+ ys ajz; = 0. 

j=l imi 
Solving this last equation for aj, taking the absolute value and repeatedly 
applying the triangle inequality leads to 

lasl< SO leila ae OW re 
jHlsHi “jel axt 

This violates the hypothesis that A is strictly diagonally dominant. Hence, A 
roust be nonsingular. 


See Wendroff [1] for a proof of parts (2) and (3). Oo 


This theorem has both theoretical and practical implications. On the theo- 
retical side, many of the techniques developed in this text require the solution of a 
linear system of equations. The coefficient matrices of some of these systems will 
be strictly diagonally dominant. When this happens, we will be guaranteed that 
the systems have a unique solution and that the underlying numerical methods are 
well defined. 

On the practical side, this theorem provides information to help us choose 
the most efficient solution technique available. If the coefficient matrix for a spe 
cific system of equations is strictly diagonally dominant, we don’t need a solution 
techqniue which uses sophisticated pivoting strategies. We don’t need any pivoting 
strategy at all. 


Symmetric Positive Definite Matrices 


The second class of special matrices that we will consider is the symmetric positive 
definite matrices. Recall that a matrix A is symmetric if A? = A; that is, if 
Qj, = ;; for each 7 and j. 
Definition. A matrix A is SYMMETRIC POSITIVE DEFINITE if it is symmetric 
and x? Ax > 0 for any nonzero vector x. 
Consider the 3 x 3 matrix 
1 -1 
Ay = boa 32 
-1 2 5 
; T 
By inspection, this matrix is symmetric. Letx = [ 21 22 23 |. Then 
x? A\x = 3a? + Ax? + bao + 2410 + 4aga73 — 22123 


= x + xe + af + (apt x9)* +(x - 3)” + 2(22 + z3)°, 
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which is clearly greater than zero for any non-zero x. Therefore, A, is symmetric 
positive definite. On the other hand, the matrix 


3. -2 —-l 
Ag=|-2 3 -2 
S). 2 3 


is not symmetric positive definite. For this matrix, 
x7 Aox = 2 (a, — £2)? + (2, — 23)” + (x2 — 23)"I,, 


which equals zero for any vector whose components satisfy x, = ro = 23; for 
example, x = [ Peel a |? orx= [ not OT 

It should be clear from these two examples that determining whether or not 
a given matrix is symmetric positive definite using the definition is a nontrivial 
task. Fortunately, there are simpler conditions that can be checked. We begin 
by considering a set of three necessary conditions. All symmetric positive definite 
matrices must satisfy each of these conditions. Therefore, any matrix which violates 
one of these conditions cannot be symmetric positive definite. 


Theorem, Let A be an n x n symmetric positive definite matrix. Then 
1. a, > 0 for each i = 1,2,3,...,7; 
2. maxi<k,j<n |@kj| S Maxr<ic<n |aii|; and 
3. aj, < aya;; fori A 9. 


Proof. We will prove the second part of the theorem here and leave the first 
and third parts for the exercises. For 7 4 k, define the vector x by 


1, t=] 
Li = mle t=k 
0, otherwise 


Since x 4 0, it follows that 
x? Ax = Qjj — Ajk ~ @kj + Oke > 0, 
or Pee, 
ij kk 
an < oy Rea (1) 


where we have used the fact that aj, = a,j; since A is symmetric. Next, 
consider the vector x defined by 


if 1, t=jort=k 
— 0, otherwise . 


Now, x7 Ax > 0 is equivalent to 


Qi; + QE 
Ong > — (2) 
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Combining (1) and (2) yields 


G55 + Akk 
ax,;| << 4 ——— < 5s 
lax 7S max lau, 
whenever j 4 k. Hence, 
max il < vel, 
Fa, lea — ein lax a 


ee 
EXAMPLE 3.18 Matrices that are not Symmetric Positive Definite 


None of the matrices 


2 1 0 8 -1l 1 361 °5 
1-2 14, -l 8 9], and 1 4 2 
0 1 2 1 9 7 5 2 8 


is symmetric positive definite. The first matrix cannot be symmetric positive def- 
inite because the diagonal element ag2 = —2 is negative, which violates the first 
condition from the theorem. The second matrix cannot be symmetric positive 
definite because it violates the second condition. Note that 


J =9>8= ah 
pies, [eal 9> max, |aas| 


The final matrix cannot be symmetric positive definite because 


a’. = 52 > 4 = 011&33- 


Next, we present a few results which allow us to show that a matrix is sym- 
metric positive definite. Proofs of the first two results will be deferred until the 
next chapter after we’ve developed some more theory regarding eigenvalues and 
eigenvectors. 


Theorem. If A is a symmetric matrix and all of its eigenvalues are positive, 
then A is symmetric positive definite. 


Corollary. If A is symmetric, strictly diagonally dominant and a; > 0 for 
each i, then A is symmetric positive definite. 


A matrix can also be identified as symmetric posilive definite by examining its 
leading principal submatrices. 


Definition. For each k = 1,2,3,...n, the first k rows and the first k columns 
of the n x n matrix A form the kth LEADING PRINCIPAL SUBMATRIX of A. 


The connection between leading principal submatrices and symmetric positive 
definite matrices is given by the following theorem. For a proof, see Stewart [2]. 
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Theorem. A symmetric matrix is symmetric positive definite if and only if 
each of its leading principal submatrices has positive determinant. 


EXAMPLE 3.19 Matrices that are Symmetric Positive Definite 


Consider the matrices 


6 -2 3 3-1 2 
2 8 JI and -1 3 1 
3. #1 7 2 1 3 


The first matrix is symmetric, strictly diagonally dominant and each of its diagonal 
elements is positive. Hence, by the corollary given above, this matrix is symmetric 
positive definite. The second matrix is not strictly diagonally dominant, so the 
corollary does not apply. However, 


det([3]) = 3 > 0; act (| 3 3 |) =8>0 and 
3-1 2 
det| | -1 3 1 =5>0, 
an es: 


so all of the leading principal submatrices have positive determinant. Consequently, 
the second matrix is symmetric positive definite. 


The following theorem summarizes the important properties of symmetric 
positive definite matrices relative to the problem of solving linear systems of equa- 
tions. The proof of the first part is considered in Exercise 8. For a proof of second 
and third parts, see Wendroff [1]. 


Theorem. Let A be a symmetric positive definite matrix. Then 


1. A is nonsingular, so Ax = b has a unique solution for any right-hand- 
side vector b; 

2. Gaussian elimination and direct factorization can be performed on A 
without row interchanges; and 

3. The calculations during Gaussian elimination and direct factorization 
are stable with respect to the growth of roundoff error; that is, no piv- 
oting strategy needs to be applied. 


The implications of this theorem are the same as those of the theorem pre- 
sented earlier for strictly diagonally dominant matrices. From a theoretical stand- 
point, we will know that a numerical method which requires the solution of a linear 
system with a symmetric positive definite coefficient matrix is well-defined. From a 
practical stnadpoint, the theorem helps us select an efficient computational scheme. 
In particular, no pivoting strategy is necessary. 
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When working with symmetric positive definite matrices, even greater effi- 
ciency can be obtained by taking into account the symmetry of the matrix. To 
do this, rather than factor the matrix into LU form, we factor the matrix into the 
form 


A=LL 
li bi ley dsi baa 
loy laa ler liga + - + Ing 
lsi tga las Ig3 lea 
lat In2 In3 Phy ee lan lan 


This produces what is known as the Cholesky decomposition. The equations for the 
Cholesky decomposition are developed in the same manner as were the equations for 
the Crout decomposition: We make passes through the matrix, each time computing 
the elements in one column of L. In the first pass, we find 


sel = fai and lit =¢n/hin (for i = 2,3,4,...,n). 


For k = 2, 3, 4,...,n—1, the diagonal element of the kth column is given by 


lik = 


k-1 
en — Thy 
j=t 
while the remaining elements in the column satisfy 
k-1 
lit = | aie — Do Laglag | [bere 
j=1 


fork +1<i<n. The final element in the Z matrix is then given by 


n-1 
ee Se De 
Nae 


EXAMPLE 3.20 Cholesky Decomposition in Action 


Consider the matrix 
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For the first pass through this matrix, we calculate the elements in the first column 
of E as follows: 


hi = Van = V4 = 2; 


ieee =1; and 


The second and third passes then produce 


lon = 4/ Q22 — i, =V4-1 


—ag9—tgiles 145 V3. 


Il 
a 


I ns 2 
32 By a 5 
and 
1 3 
I33 = 1/ aa3 — 12, — 12 = -5-5=¥3. 
Therefore, 
4 2 -l 2 0 QO i By. PO” 
Ee Se ee ee ee 


Taking the symmetry of the matrix into account when constructing the fac- 
torization algorithm provides more than just aesthetic benefits. The Cholesky de- 
composition requires 3n* + $n? — 8n arithmetic operations, plus n square roots. 
This is roughly half the number of operations used by Gaussian elimination and 
genera] direct factorization. It is possible to eliminate the square roots by factoring 
the matrix into the product LDL’, where L is a lower triangular matrix with ones 
along the main diagonal and D is a diagonal matrix. The problem of constructing 
such a factorization algorithm is considered in Exercise 17. 


Tridiagonal Matrices 


The final special class of matrices we will discuss is the tridiagonal matrices. 
Definition. The matrix A is TRIDIAGONAL if a;; = 0 whenever |i - j| > 1; 
that is, on the ith row of the matrix, the only nonzero elements are a; ;-1, 


a, and a; 441. 


Given the structure of a tridiagonal matrix, we will modify the general Crout 
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decomposition algorithm and seek lower and upper triangular matrices of the form 


hi 1 ur 
lor loo 1 us3 
Igo Lgg 1 us4 


Un-1n 


lan 1 lan 1 


Working again in passes, with each pass computing the elements in one column 
of LZ and one row of U, we find that the first pass gives 


hi=au, bi=ae and ur = a2/h. 


For each subsequent pass (that is, for k = 2, 3, 4,..., »—-1), the relevant formulas 
are 


Ikk = kk — Ckje-1Uk—1,e.  Uepik = Asie amd Uh eqi = On,rra/lee- 


The last element along the diagonal of L is given by lan = Qnn — lnn-1Un-1n- 


EXAMPLE 3.21 Solving a System with a Tridiagonal Coefficient Matrix 
4 -1 6 
a Sars go || ee 
—2 4 -1 7 3 
—2 4 4 


To solve this system, we start by factoring the coefficient matrix. In the first pass, 
we calculate 


a12 1 
hi=an=4 Igy =a. = 2; U2= 7 =-7- 
The second and third passes then calculate 
9 _ a3 2 
log = G22 — lorur2 = oy Igo = Q32 = 2; U3 = in 
d 
ia 32 ‘ as 34 a 9 
lgg = gg — lggta3 = 3 lag = 043 = —2; Ce Tag 
The final pass gives " 
ig = Qga — lqgta3 = 16: 
The complete LU decomposition of the coefficient matrix is then 
4 ie 
Q 2 i —3 
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Moving on to the solution step, forward substitution applied to Lz = b yields 


. b 3 Be — la 21 9 
ne age, oo loa 
pe te ree eee es Sa ee 
&3g 32° : laa , 
Forward subsitution then produces 
La=%4=1, £3 = 23 — ugats = 0; 
BQ = zy — ug3%3 = —2; @y = 2 — Wye. = 1. 


Note that the solution of a linear system with a tridiagonal coefficient matrix 
is very inexpensive. The factor step requires 3n — 3 arithmetic operations, and the 
solve step requires 5n — 4 operations. The total cost is thus only 8n —7 operations. 
This is a significant reduction from the roughly gn? operations needed to solve 
a linear system with an arbitrary coefficient matrix. Because tridiagonal matrices 
will arise so often in the coming chapters, a complete algorithm for solving a system 
with a tridiagonal coefficient matrix is given in Appendix B. 


Application Problem: Multistage Chemical Extraction 


In the Chapter 3 Overview (see page 139), we developed the system of equations 


—~(W + Sm)x2,+ Sma2 = -Wrin 
Wa;-, —(W+Sm)x; + Sma) =0 (i = 2,3,4,...,.r-1) 
Wan-1 —-(W+ Sm)ty = -Syin, 


where x; denotes the mass fraction of a chemical exiting the ith stage of a counter- 
current extraction reactor in the water stream. The parameters are the flow rate 
in the water stream W, the flow rate in the solvent stream S, the ratio of the mass 
fraction of the chemical in the solvent stream to the mass fraction in the water 
stream m, the mass fraction in the input water stream zj,, and the mass fraction 
in the input solvent stream yin. 

Suppose we are working with a six-stage reactor with a water stream flow rate 
of W = 200 kg/br and a solvent stream flow rate of S = 50 kg/hr. The input mass 
fractions are 2;, = 0.075 and yi, = 0, and m = 7. The system of equations for the 
x, then becomes 


-550 350 2 —15 
200 —550 350 x2 0 
200 —550 350 x3 = 0 

200 —550 350 24 a 0 

200 -550 350 2X5 0 

200 —550 x6 0 
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The solution of this system is 


x = [ 0.042205 0.023465 0.012756 0.006637 0.003140 0.001142 ]”. 


Note that the mass fraction in the water stream as it exits the reactior is only 
0.001142, which is a reduction of nearly 98.5% from the input mass fraction. 
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EXERCISES 


1. Classify each of the following matrices as strictly diagonally dominant, symmetric 
positive definite, both, or neither. 


2 -1 0 bo 4 
fay | <1 4 8 (by: | 4-6 <1 
0 2 6 aa ae 
ae ee fe, 2c) 
(ll See "0 (dy 32) 8 4 
2 0 6 ood 27 
ge a A He aft? ie ap 
0 43 7-9 i) Bee 1 
() 19 9 6 2 (f) |} 1 9 -2 0 
0 0 01 bit oa 


2. Consider the 2 x 2 symmetric matrix 


a b 
boc l- 
What conditions must the elements a, b, and c satisfy to guarantee that the 
matrix is positive definite? 
a -1 0 
-l1 4 1]. 
0 1 6 


(a) For what values of a will this matrix be positive definite? 
(b) For what values of a will this matrix be strictly diagonally dominant? 


8. Consider the matrix 


4. Repeat Exercise 3 for the matrix 


5 —-2 2 
—2 6 al. 
2 a 7 


10. 


il. 


12. 


13. 
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. Consider the matrix 


(a) What conditions must a and b satisfy for this matrix to be symmetric positive 
definite? 


(b) What conditions must a and b satisfy for this matrix to be strictly diagonally 
dominant? 


. (a) Suppose that A is a strictly diagonally dominant matrix. Show that the 


matrix —A is strictly diagonally dominant but that the matrix A? need not 
be strictly diagonally dominant. 

(b) Suppose that A and B are both strictly diagonally dominant matrices. Show 
that A+ B, A—B, and AB need not be strictly diagonally dominant. 


. (a) Suppose that A is a symmetric positive definite matrix. Show that the 


matrix —A is not symmetric positive definite but that the matrix AT is 
symmetric positive definite. 

(b) Suppose that A and B are both symmetric positive definite matrices. Show 
that A+ B is symmetric positive definite but that A — B need not be 
symmetric positive definite. 


. Show that if the matrix A is symmetric positive definite, then A is nonsingular. 
. Let A be an n x n symmetric positive definite matrix. 


(a) Show that ai > 0 for each i = 1,2,3,...,n. 
(b) Show that ag; < ajja5; fori #9. 
Compute the Cholesky decomposition for each of the following matrices. 


16 -28 0 9/4 3 3/2 
(a) | -28 53 10 | (b) | 3 25/4 7/2 
0 10 29 3/2 7/2 17/4 
4 429-239, D 1 -2 3 -2 
2 & i 33 ee a: 
() | 2 1 wo 3 (4) | 3 2 nn 5 
0 ~2 3 18 2 8 -5 9 


Show that the computation of a Cholesky decomposition for an n x m matrix 


requires an? + An? a an arithmetic operations plus 7 square roots. 


(a) Construct an algorithm to perform forward and backward substitution on 
the system Ax = b, given a Cholesky decomposition (A = LL) for the 
coefficient matrix? 

(b) How many arithmetic operations are required by the algorithm from part 
(a)? 

Solve each of the following systems by computing a Cholesky decomposition for 

the coefficient matrix and then performing forward and backward substitution 

(see Exercise 12a). 

(a) A = matrix given in Exercise 10a, b = [ 8 -2 38 \" 


(b) A = matrix given in Exercise 10b,b=[3 1 9 ia 
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16. 


17. 


18. 
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(c) A = matrix given in Exercise 10c, b = [ 4 -4 4 —-13 ik 
(d) A = matrix given in Exercise 10d, b=[15 -12 56 —35 ]7 

Solve each of the following systems of equations. Note that each system has 2 
tridiagonal coefficient matrix. 


(a) 
32, —- Zz9 = - 2d 
2} + 442 + 23 = 7 
3a2 + 5243 - 2 = 15 
—223 + Tre = 18 
(b) 
22, -— 29 = 0 
—21} + 229 3 = 0 
—%2 + 273 - 24 = O 
—@ + 24 = § 
(c) 
497, — £9 = 3 
—-%, -~ 5x + 623 = Q 
tg — 3273 + 2 = —-4 
zz + 324 = ~-2 


Repeat the “Multistage Chemical Extraction” problem with a solvent stream 
input mass fraction of y;, = 0.02. By what percentage is the mass fraction in 
the water stream reduced? 


An absorption column works much like an extraction reactor (see page 139). A 

gas stream with flow rate G and input mass fraction y;n, of a chemical is used to 

transfer the chemical to a liquid stream that has a flow rate L and an input mass 

fraction 2j,. At equilibrium, it is assumed that y; = ma;, where x; and y; are 

the mass fractions of the chemical within the liquid and gas streams, respectively, 

as they exit the ith stage of the column. 

(a) Set up the system of equations for an n stage absorption column. 

(b) If L = 2500 kg/hr, G = 4000 kg/hr, tin = 0, yin = 0.05, and m = 1.46, 
what is the mass fraction in the liquid stream as it exits an eight stage 
column? 


(a) Construct an algorithm to factor an nxn symmetric positive definite matrix 
into the form LDL” , where L is a lower triangular matrix with ones along its 
diagonal and D is a diagonal matrix. How many arithmetic operations are 
required to compute the LDL? decomposition? How does this compare with 
the number of operations needed to compute a Cholesky decomposition? 

(b) Construct an algorithm to solve the system Ax = b given an LDL? decom- 
position of the coefficient matrix. How many arithmetic operations does this 
solve step require? How does this compare with the number of operations 
required by the solve step associated with a Cholesky decomposition? 


Repeat Exercise 13 using an LDL? decomposition rather than a Cholesky de- 
composition. 
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19. A matrix A is pentadiagonal if ajj = 0 whenever |i — j| > 2. 
(a) Construct an algorithm to efficiently compute the Crout decomposition of a 
pentadiagonal matrix, 
(b) How many operations are required by the algorithm from part (a)? 
(c) How many operations are needed to carry out forward and backward sub- 
stitution using the decomposition obtained from part (a)? 


3.8 ITERATIVE TECHNIQUES FOR LINEAR SYSTEMS: BASIC CONCEPTS AND 
METHODS 


Having just devoted several sections to the development of direct techniques for lin- 
ear systems—techniques that produce an answer in a fixed number of operations—it 
is natural to ask why we would want or even need to develop iterative techniques. 
For systems of small dimension, there is no need. Direct techniques will perform 
very efficiently. However, linear systems arising from practical applications will fre- 
quently be quite large. The coefficient matrices associated with these systems also 
tend to be sparse, meaning that only a small percentage of the entries are nonzero. 
We will encounter systems of this type in Chapter 9 when we treat the solution of 
elliptic partial differential equations. 

For systems with large, sparse coefficient matrices, direct techniques are often 
less efficient than iterative techniques. Even though multiple iterations need to 
be performed to achieve convergence, an iterative solution will typically require 
fewer total operations than a direct solution. It will often happen that the nonzero 
elements in the coefficient matrix will exhibit a well-defined pattern. In these cases, 
an iterative solution will not require the storage of the coefficient matrix at all— 
only the structure of the equations will be needed. As an added bonus, iterative 
techniques are generally insensitive to roundoff error. 


Basic Concepts 


Basic iterative techniques for the solution of linear systems of equations are analo- 
gous to the fixed-point techniques which were discussed in Chapter 2. The original 
linear system Ax = b, which can be interpreted as the rootfinding problem 


find the n-vector x so that Ax —-b=0, 
is first converted to the fixed point problem 
find the n-vector x so that x =7Tx+¢, 


for some matrix T and vector c. Next, starting from some initial approximation 
to the solution of the fixed point problem, x, a sequence of vectors {x‘*)} is 
computed according to the rule 


x(t) — Tx) 4 (1) 


Within this context, the matrix 7’ is called the iteration matriz. The functional 
iteration is terminated when some appropriate measure of the difference between 
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successive vectors in the sequence, x‘*) and x(*t)), falls below a user specified 
tolerance. 

The analysis of the functional iteration scheme given by (1) boils down to 
four important questions. First, what conditions guarantee a unique solution to the 
fixed point problem? Second, under what conditions will the sequence generated 
by (1) converge to this unique fixed point? Third, when the sequence generated 
by (1) converges, how quickly does it converge? Fourth, what conditions must 
the matrix J and the vector c satisfy im order for the fixed point problem to be 
consistent with the original rootfinding problem (i.e., for the two problems to have 
the same solution)? 

The following theorem from general matrix theory plays a major role in es- 
tablishing answers to these questions. 


Theorem. Let A be an n x n matrix. Then the following statements are 
equivalent: 

1. p(A) <1, where p(A) denotes the spectral radius of A; 

2. A* — 0 as k > 00; and 

3. A*x > 0 as k — 0 for any vector x. 


A proof of this result can be found in Isaacson and Keller [1]. 
Let’s start with the question of the uniqueness of the solution of the fixed 
point problem. Manipulating the fixed-point equation, we find 


x=Tx+ceex—-Tx=c 
& (1 -T)x=c. 


From this last equation, it follows that the fixed point problem has a unique solution 
if and only if the matrix J —T is nonsingular. A sufficient condition for J—T to be 
nonsingular is p(T) < 1 (see Exercise 10 from Section 3.3). Hence, the fixed point 
problem is guaranteed to have a unique solution whenever p(T) < 1. 

To establish convergence, let x* denote the solution to the fixed point problem, 
and define the iteration error vector by e(*) = x") —x*. Subtracting the equation 
x* = Tx* +c from equation (1) yields the error evolution equation 


elf +)) — Tel*). 


Working backward through this equation, we find 
efkt)) = Tel*) 
=T(Tel*-) = Tek) 
= Tere) & 73 Q(k-2) 


= pkt 1 (0) ; 
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Ideally, e+!) should approach 0 as k — oo for any choice of initial vector x, 
that is, for any initial error vector e), The theorem stated above indicates that 
this will happen if and only if o(T) < 1. Hence, the iteration scheme defined by 
equation (1) will converge for any choice of the initial vector x if and only if 
A(T) <1. 

From the error evolution equation, we find 


le] < ITI le | 


for any vector norm || - || and associated natural matrix norm. Provided ||T'|| < 1, 
it can be shown that 
jel) < I x) — x10 
~ 1-7 


(see Exercise 14). These two inequalities imply that the sequence {x*)} converges 
linearly with an asymptotic error constant that is less than or equal to ||T'||. Carry- 
ing out a more precise analysis, it can be shown that the asymptotic error constant 
is equal to p(T). The proof is based on the fact that p(T) is the greatest lower 
bound for all natural matrix norms of T. See Ortega |2] for details. Thus, the 
smaller the spectral radius of the iteration matrix, the faster the convergence of the 
corresponding iterative scheme. 

The final preliminary issue to discuss is that of consistency. In order for the 
iteration defined by (1) to be of any practical use, the solution of the fixed point 
problem, x* = (J — T)~+c, must be identical to the solution of the original linear 
system, x* = A7~}b. Hence, when constructing the fixed point problem from the 
linear system, we must be certain that T and c satisfy the relation 


(I-Ty ce = Ab. 


Splitting Methods 


A broad class of consistent iterative methods, known as splitting methods, can be 
constructed by introducing the notion of a splitting. 


Definition. Let A bea givennxn matrix. If M and N arenxn matrices with 
M nonsingular and A = M — N, then the pair (M,N) is called a SPLITTING 
of the matrix A. 


So let’s suppose that (M,N) forms a splitting of the matrix A. Then 
Ax=b_ is equivalentto (M-—N)x=b. 


Clearing the parentheses and transposing the term involving the matrix N to the 
right-hand side of the equation yields 


Mx=Nx+b. 
Finally, premultiplying by M7! produces 
x = M71Nx+M~'b. 
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Hence the splitting A = M — N determines the fixed point problem x = Tx+c¢ 
and associated iteration scheme x*+1) = Tx(*) + ¢, where 


T=M"!N and c=M~'b. 


To establish that splitting methods are always consistent, first note that with 
T=M=N 
I-T=I-M™ N 
=M4(M-N) 
=M7A. 


Therefore, (I -T)-! = A? M, Finally, with c = M7!b, 


(I—-T)'c=A 1M: Mb 
= A'b, 
as required. Descriptions of the three most commonly used splitting methods are 


presented below. The convergence properties for these three methods will then be 
discussed at the end of the section. 


The Jacobi Method, Gauss-Seidel Method, and SOR Method 
To identify the splittings associated with the Jacobi method, the Gauss-Seidel 
method, and the SOR method, first express the coefficient matrix A in the form 


A=D-L-U. 


Here, D is the diagonal part of A, —L is the strictly lower triangular part of A, and 
~U is the strictly upper triangular part. It is important to keep in mind that the 
matrices L and U used here are in no way related to the LU decomposition of the 
coefficient matrix. As an example, suppose 


5 1 2 
A=|-3 9 4 
1 2 -7 
Then 
5 0 0 0 ob O 0 -1 -—2 
D=;10 9 0}, L=] 3 O 0 and U=/0 0 -4]. 
0 0 -7 —-1 -—2 0 0 0 0 


The Jacobi method is based on the splitting M = D and N=L+U. In 
order for M to be nonsingular, it must be the case that, for each 7, dj; = aii #0. 
If this relationship does not hold for even a single value of 7, then the equations 
in the system must be reordered before the Jacobi method can be applied. With 
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the specific choice of splitting indicated above, the iteration scheme for the Jacobi 
method is defined by 

x (e+) = Tiga” + Caer (2) 
where 
Tie = D(L4U) «and ¢jg,=D™b. 


Taking into account the structure of the iteration matrix, Tjo,, and the vector Cjac, 
the individual components of equation (2) can be written as 


i-1 n 

1 

2p) = — bbe Sang) — Saag”) (3) 
mae j=l j=itl 


Hence, the Jacobi method is equivalent to solving the 7-th equation in the system 
for the unknown 2;. 


Since, in general, the value of a") will be needed to compute a) for each 


j =14+1,1+2,...,n, the value of a) cannot be overwritten by the newly computed 
value of (ht), This implies that when implementing the Jacobi method, two 


storage arrays will have to be maintained, one for the old approximation vector 
x‘*) and one for the new approximation vector x‘*)), It also follows that the 
components of x!*+)) can be computed in any order and that, on a parallel or 
vector machine, all components of x‘**)) can be computed simultaneously. For this 
reason, the Jacobi method is often called the method of Simultaneous Relaxation. 


EXAMPLE 3.22 The Jacobi Method in Action 


Consider the system of equations 


51 + Ootn 6+ «O23 =~ 10 
—32, + 9%. + 443 = —-14 
zy + 22%. - Ta, = 33. 


The Jacobi method, when applied to this system, will produce the sequence of 
approximations {x*)} according to the rules 
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If we start withx® =[0 0 0 ie then the components of x!!) are 


[-14 +320) — 40)” = = 


1 
a!) = = [-33 - (9 2a | = - , 


The following table summarizes the 14 iterations of the Jacobi method that were 
needed for IIx@+1) — x(*) || to fall below 5 x 10-4. Other stopping criteria can be 
imposed, but this is the most common. The exact solution to this problem is 


x=[1 -3 4]7. 


Hence, |x — x4)||,, = 2.48 x 1074, 


k xl) 

0 [0.000000 0.000000 0.000000 |” 
1 [ 2.000000 ~1.555556 4.714286 ]7 
2 [ 0.425307 -2.984127 4.555556 ]” 
3 [ 0.774603 ~3.438448 3.922449 ]” 
4 [ 1.118710 ~3.040665 3.842530 ]” 
5 [ 1.071121 -2.890443 4.005340 ]” 
6 [ 0.975953 2.978666 4.041462 ]” 
7 [0.979148 -3.026443 4.002660 |” 
8 [| 1.004225 -3.008133 3.989466 |” 
9 [| 1.005840 2.993910 3.998280 ]” 
10 [ 0.999470 ~2.997289 4.002574 ]7 
11 [ 0.998428 —3.001321 4.000699 }* 
12 [ 0.999985 ~3.000835 3.999398 ]” 
13 [ 1.000408 —2.999738 3.999759 ]” 
14 [ 1.000044 —2.999757 4.000133 |” 


An obvious improvement that can be made to the Jacobi method is to use the 


value of gift?) as soon as it has been calculated in the computation of all subsequent 


entries in the vector x(*t+)) rather than waiting until the next iteration. After all, 
okt) ig supposed to be a better approximation to 2; than a”, This modification 
amounts to changing equation (3) to 


a 


i-1 n 
1 
oft) — i bi — y age - ; a,j") i (4) 
tt . bh 
> j=l jit 
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the only difference between the equations is that the superscript on x in the first 
summation is now k +1. The iteration scheme corresponding to equation (4) is 
known as the Gauss-Seidel method. Note that the Gauss-Seidel method is not 
vectorizable. The entries in x‘*+) must be computed in succession. Hence, the 
Gauss-Seidel method is also known as Successive Relaxation. 

Working backward from equation (4), we find that the splitting upon which 
the Gauss-Seidel method is based is 


M=D-L and N=vU. 
Thus, the iteration matrix for the Gauss-Seidel method is given by 
T= (D1) 10, 
and the vector ¢ is given by 
Co PS Dr 


The necessary and sufficient condition for the matrix M to be nonsingular is the 
same as above: for each 7, we must have d;; = ai, 4 0. 


EXAMPLE 3.23 The Gauss-Seidel Method in Action 


Reconsider the system of equations 


bay + te Ut O23 = 10 
—324, + Q¢. + 4drg3 = -14 
Ly + 2% - 7x3 = —33. 


The Gauss-Seidel method, when applied to this system, will produce the sequence 
of approximations {x*)} according to the rules 


ght) — = [10-2 P= on 


oft) — 1 14 + 3a("*) — 42$| 


1 
oft?) = a [-33 a okt) - aa ; 


If we start with x =[0 0 0 les then the components of x") are 


1) 1 0 (0 
ge =; [10- al) — anf] = 2 
a! (1) 0) =-5 
= 5 [- 14 + 8249 ~ deg] =~ 
1 
7 


299 
ze =— [- 33 — 2) - 2z$”] =a: 
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The following table summarizes the 10 iterations of the Gauss-Seidel method that 
were needed for ||x**4) — x(*)||,, to fall below 5 x 1074. Recall that the exact 
solution to this problem is 


x=[1 -3 4]? 


Hence, |x — x0 ||,, = 7.8 x 1075, Note that convergence is obtained with the 
Gauss-Seidel method in roughly 30% fewer iterations than the Jacobi method. Fur- 
ther, the error in the final Gauss-Seidel approximation is roughly one-third the error 
in the final Jacobi approximation. 


k xh) 
0 — [ 0.000000 0.000000 0.000000 }” 
1 [ 2.000000 -~0.888889 4.746032 
2  [ 0.279365 —3.571781 3.733686 | 
3 [| 1.220882 2.808011 4.086409 
4 [ 0.927039 -3.062724 3.971656 | 
5 | 1.023883 -2.979442 4.009286 
6 
7 
8 
9 


a3 3 9 3 8 4 4 44 


[ 0.992174 ~3.006736 3.996958 | 
| 1.002564 -2.997793 4.000997 
[ 0.999160 —3.000723 3.999673 |] 


1.000275 —2.999763 4.000107 
10 [ 0.999910 ~3.000078 3.999965 | 


The final iterative technique that we will discuss in this section is the SOR 
method. An explanation for the name of the method will be provided shortly. This 
technique attempts to improve upon the convergence of the Gauss-Seidel method 
by computing ght) as a weighted average of a) and the value produced by 
the Gauss-Seidel method, as given in equation (4). Let the weighting parameter, 
also known as a relaxation parameter, be denoted by w. Then the analogue of 
equations (3) and (4) for the SOR method is 


att) = w)a!” + — by — Sas, ga8 (ee) =¥ a; ju” : (5) 
ii j=l j=itl 
Note that when w = 1, the SOR method reduces to the Gauss-Seidel method. Typi- 


cally, there exists a range of w values for which the SOR method will converge faster 
than the Gauss-Seidel method. The splitting associated with the SOR method is 


oes and N= (2-1)D+u 
w w 
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ta (0-4) [3rd 


1 -1 
Coon = (“D - 1) b. 
w 


EXAMPLE 3.24 The SOR Method in Action 


Therefore, 


and 


One more time we will consider the system of equations 


of, + 2g + 243 = 10 
32) } 9x t 4x3 = ~—14 
Al + 242 - %%3 = —33. 


With w = 0.9, the iteration equations for the SOR, method become 
0.9 
(hth) = 0.12(" + 5 [20 = o*) = 2x{| 


a more 28-40 


9 
of) = o10) * aoa) ae 


If we start with x =[0 0 0 ile then the components of x") are 


0.9 
of) = 0.1200 + [10 a 2) - 2n()| =18 


a) = 0.120 + “s [-14 + 3x") — 4x| = -0.86 
0.9 
of? = 0.12 —— [33 - af? - af) = 4.253143. 


The table shown below summarizes the 6 iterations of the SOR method that were 
needed for ||x%t+) — x(*)|| 5 to fall below 5 x 107+. Note that |x — x ||, = 
6.0 x 107%. 

x(k) 

[ 0.000000 0.000000 0.000000 } 
[ 1.800000 ~0.860000 4.253143 | 
0.603669 -3.006157 3.972774 | 
[ 0.971276 -2.998342 3.994011 | 
0.998985 -2.997743 3.999851 ] 
[ 0.999546 -2.999851 3.999965 | 
[ 0.999940 -2.999989 3.999992 } 


Pr 


ao op wWwnr © 
| 
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Nureiese of Keralions to achiere ssivergence 
i} 


1 ne L 2 
0S O68 OF Bex 0.9 ‘ 4.1 
felazstion parameles, 


‘The selection of the parameter w is crucial to the performance of the SOR 
method. The figure above displays the number of iterations required for the SOR 
method to converge to within a tolerance of 5 x 1074 as a function of w. The hori- 
zontal lines indicate the number of iterations required by the Gauss-Seidel method 
(bottom line) and the Jacobi method (top line). We see that there is a range of 
values, roughly 0.65 < w < 1.0, for which the performance of the SOR method is as 
good as or better than the performance of the Gauss-Seidel method. As expected, 
the SOR method outperforms the Jacobi method over a broader range of w values. 


For the example problem we have been examining in this section, the SOR 
method performer better than the Gauss-Seidel method primarily for w < 1. How- 
ever, for many of the practical problems to which the SOR method is applied (such 
as the systems of equations associated with the solution of elliptic partial differen- 
tial equations which we will encounter in Chapter 9), performance is better than 
the Gauss-Seidel method for w > 1. When w is selected greater than 1, the iterative 
method is referred to as an overrelaxation scheme. Since the Gauss-Seidel method 
is successive relaxation, this new technique is Successive overrelaxation, or the SOR 
method for short. 


Specific Convergence Properties for the Jacobi, Gauss-Seidel, and SOR 
Methods 


We know that the general iterative method x) = Tx”) + © converges if and only if 
p(T) <1. Are there conditions that, when imposed upon the coefficient matrix A, 
will guarantee that the Jacobi, Gauss-Seidel, and SOR methods converge? The 
answer to this question is yes; unfortunately, there is no general theory, just a 
collection of specia] cases. For example, it is known that strict diagonal dominance 
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of A is sufficient to guarantee that both the Jacobi method and the Gauss-Seidel 
method will converge for any choice of the initial vector x. The proof for the 
Jacobi method is considered in Exercise 15, while the proof for the Gauss-Seidel 
method can be found in Ortega [2]. 

The following results regarding the Gauss-Seidel method are also useful in 
practice. 


Theorem. 


1. If A is real and symmetric with all positive diagonal elements, then the 
Gauss-Seidel method converges if and only if A is positive definite. 

2. If A is positive definite, then the Gauss-Seidel method will converge for 
apy choice of the initial vector x). 


A proof of the first part of this theorem can be found in Isaacson and Keller [1]. 
Consult Ortega [2] or Ralston and Rabinowitz [3] for a proof of the second part. 
For the SOR. method, the most important convergence results are as follows: 


Theorem. 


1. If A has all nonzero diagonal elements, then p(Tsor) > |w—1|. Therefore, 
the SOR method can converge only if 0 <w < 2. 


2. If A is positive definite and 0 < w < 2, then the SOR method will 
converge for any choice of the initial vector x(). 


See Ortega [2], Young [4], or Golub and Ortega [5] for details. 

What about the speed of convergence? For the sample problem treated in 
this section, it was found that the Gauss-Seidel method converged in fewer itera- 
tions than the Jacobi method. Will this relative performance hold in general? The 
answer to this question is no. There are, in fact, coefficient matrices for which 
the Jacobi method will converge, but the Gauss-Seidel method will not—see Exer- 
cise 13. Furthermore, there is no general theory to indicate which method, Jacobi or 
Gauss-Seidel, will perform best on an arbitrary problem, just a collection of special 
cases. For example (see Ralston and Rabinowitz [3] or Young [4] for a proof). 


Theorem. Suppose A is ann x n matrix. If a;,; > 0 for each ¢ and a; <0 
whenever 7 # 7, then one and only one of the following statements holds: 


10< p(T ys) = AT ae) <1; 
2. 1 < p(Tjac) < p(T gs); 
3. plTjac) = P(T ys) = 0; 
4, P(Tjac) = o( Lys) =1. 


Thus, under the hypotheses of this theorem, when one method converges, so will 
the other method, with the Gauss-Seidel method converging faster. On the other 
hand, when one method diverges, so will the other, with the Gauss-Seidel method 
diverging faster. 
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The final issue we will address is that of the choice of the relaxation parameter 
for the SOR method. In practice, one of the most important special cases is the 
following. 


Theorem. If A is positive definite and tridiagonal, then p(Tys) = [p(Tjac)]’ < 
1, and the optimal choice of the relaxation parameter, w, for the SOR method 


is 
2 


“14 fl PEP 


With this choice of w, p(Tsor) =w — 1. 


Ortega [2} provides a proof of this result. There are more general conditions under 
which the formula for the optimal value of w given in the above theorem holds, but 
these require more advanced matrix theory concepts than we have developed here. 
For details consult Stoer and Bulirsch [6], Young [4], or Varga [7]. The optimal 
choice for w for an arbitrary linear system remains an open question. Methods 
for computing the optimal value of w during the iterative process are discussed by 
Hageman and Young [8]. 
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EXERCISES 


In Exercises 1-4, 
(a) Compute Tjac and Tys for the given matrix. 
(b) Determine the spectral radius of each iteration matrix from part (a). 


(c) Will the Jacobi method converge for any choice of initial vector x)? Will the 
Gauss-Seidel method converge for any choice of initial vector x? Explain. 
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2 -1 
") -1 3 
Te, “2 
"[3 4] 
4 -1 -2 
» | -1 3 «0 
0 -1 3 
3 2 —-2 
. | -2 -2 #1 
5 -5 4 


. For each of the following coefficient matrices and right-hand side vectors, write 
out the components of the Jacobi method iteration equation. Then, starting 
with the initial vector x = 0, perform two iterations of the Jacobi method. 


og eat 
| A 2]. [3 | 

0 2 6 5 

3: 2 a 4 
| 2 -6 3 | [| 

a6) oi: 2200 7 


4 -1 0 0 0 
2 4 -1 0 2 
(d) Q -2 4 -1 |’ -3 
0 0 -2 4 1 


6. Repeat Exercise 5 for the Gauss-Seidel method. 


In Exercises 7-10, use both the Jacobi method and the Gauss-Seidel method to solve 
the indicated linear system of equations. Take x) = 0, and terminate iteration when 
JSF) — x1). falls below 5 x 107°. Record the number of iterations required to 


achieve convergence. 


Te 
4x1 
Ty 
Ty 
2) 
8. 
9. 
TZ, = 
—32, + 


4 
+ 
+ 


tz + 2% 
822 + 223 
2%, — 523 

+ 2x3 
— 22 
+ 422 — 
= 0a 

+ 23 
+ 323 - 

—z3 «+ 


-5 
23 
9 
4 


It Il 
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10. 


11. 


12. 


13. 


14. 


15, 


16. 
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4x, — 2x2 — 244 = -l 
—-m1 + 442 ~— 23 — 225 = 0 
— 22 + 423 Oe es OF 

—fZ1 + 4%4 - 25 a 
— 22 - at + 4¢5 -— a = 1 

— 23 - 25 + 4% = 2 


The linear systems in Exercises 8 and 9 have positive definite, tridiagonal co- 
efficient matrices. Determine the optimal value of the relaxation parameter for 
the SOR method for each system. Using the corresponding optimal value of w, 
solve the systems from Exercises 8 and 9. Take x) = 0, and terminate iteration 
when |x?) — x() 1. falls below 5 x 107°, 


For each of the linear systems in Exercises 7-10, generate a plot of the number 
of iterations required by the SOR method to achieve convergence as a function 
of the relaxation parameter w. Take x) = 0, and terminate iteration when 
[xP ~ x *) || 46 falls below 5 x 1075.: Over roughly what range of w values 
does the SOR method outperform the Gauss-Seidel method? the Jacobi method? 


Let 
2 4 -4 
3.3 #3 |. 


a) Write out the iteration matrix Tya¢ corresponding to the matrix A, and 
3 . 
determine p(Tjac). Will the Jacobi method converge for any choice of the 


initial vector x‘? 

(b) Write out the iteration matrix Ty, corresponding to the matrix A, and 
determine p(Zys). Will the Gauss-Seidel method converge for any choice of 
the initial vector x? 

Consider the iteration scheme xt) = Tx(*) +c, and suppose that ||T|| <1 

for some natural matrix norm. Show that for any x €R” 


7 
Ie— xy < ls ak? - 2° 


(Hint; Review the proof of the fixed point iteration convergence theorem in 
Section 2.3.) 

Let A be astrictly diagonally dominant matrix and let Tjac be the Jacobi method 
iteration matrix associated with A. Show that e(Tjac) < 1. (Hint: Show that 
\[Tjaclloo < 1 and use the fact that the spectral radius of a matrix is smaller than 
any natural matrix norm of that matrix.) 


Suppose that p(T’) < 1. Show that 


Gat Se 
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17. The variables of interest in an absorption column are the steady-state composi- 
tion of solute in the liquid, «:, on each plate and the steady-state composition of 
the solute in the gas, y;, on each plate. Suppose we have a six plate absorption 
column, where the inlet compositions, x9 = 0.05 kg solute/kg liquid and y7 = 0.3 
kg solute/kg inert gas, are known, as are the liquid and gas flow rates, L = 40.8 
kg/min and G = 66.7 kg/min. Further, we will assume that the linear equilib- 
rium relationship y; = 0.722; holds. Performing a material balance around an 
arbitrary plate, we find that the 2; satisfy the system 


88.824 48.024 0 0 0 


0 Ly 
40.8 —88.824 48.024 0 0 0 22 
0 40.8 —88.824 48.024 0 0 23 
0 0 40.8 —88.824 48.024 0 x4 
0 0 0 40.8 —88.824 48.024 25 
0 0 0 0 40.8 —88.824 x6 
—2.04 

0 

- 0 

= 0 

0 

—20.01 


Determine the x; using 

(a) the Jacobi method; 

(b) the Gauss-Seidel method; and 

(c) the SOR method with w ranging from 1.1 through 1.9 in increments of 0.1, 


3.9 ITERATIVE TECHNIQUES FOR LINEAR SYSTEMS: CONJUGATE GRADIENT 
METHOD 


Not all iterative methods for the solution of linear systems are based on the notion. 
of a splitting and the conversion to a fixed point problem. For example, there is a 
completely separate class of techniques based on the equivalence between the solu- 
tion of a linear system and the minimization of an associated quadratic functional. - 
The conjugate gradient method, which is a popular choice for the solution of large 
sparse systems, belongs to this latter class of techniques. For an overview of the 
many other methods which belong to this class of minimization methods, consult 
Ueberhuber [1]. 


Minimizing a Quadratic Functional and Solving a Linear System 


Suppose that A is a symmetric and positive definite n x n matrix. This assumption 
will be made throughout the section. Consider f : R® — R defined by 
1 


f(x) = ax Ax —b’x+e. 
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A mathematical object such as this, which operates on the components of a vector 
to produce a scalar output, is referred to as a functional. In particular, f happens 
to be a quadratic functional — x’ Ax consists of a collection of terms, all of which 
are of degree two in the components of x. Under the assumption that A is positive 
definite, f behaves much like an upward opening parabola and hence has a unique 
global minimizer; that is, there exists a unique vector x* such that f(x*) < f(x) 
for allx € R®, x 4x", 

From basic calculus, we know that the location of the minimum value of f 
can be determined by requiring the gradient of f, 


we[h wR KY. 


to be identically zero when evaluated at x = x*. In terms of the elements of A and 
the components of x, 


i=l j=l j= 
Therefore, for each k, 
—— LiQik + SO ang, — by 
ork 2 (s j=l 
1 n 7 
= Sana +S anjx5 } — be 
w=1 j=l 


where the symmetry of A was used in going from the first line to the second. Thus, 
Vf = Ax-—b, 


and locating the minimum of f is equivalent to solving Ax = b. 

So how do we numerically approximate the global minimizer of the func- 
tional f? Most of the techniques developed for this problem start with an initial 
guess, x, and then generate the sequence {x(™)} according to the rule 


xD) = gel) 4, a), 


The vector d'™) is called the search direction, and the scalar Am is the step size. Of 
course, different minimization methods are determined by different choices of the 
step size and the search direction. We will now turn our attention to a discussion 
of the step sizes and search directions which define the conjugate gradient method. 
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Choosing the Step Size 


Suppose that the approximate minimizer, x”), and the search direction, d‘), are 
known. We will apply the following basic principle to determine the corresponding 
step size: 


select Am so that f(x) is minimized along the line x = x°™ + \a™, 
Substituting x = x(™ + \d(™) into f(x) gives 


Flox™ + xd) = ; (xi = rae) A (x) ae ra) 


—pb? (xim af ra”) +e 
1 (myt gytm) by atm? aye). Ly omy? p gin) 


sean" Ad’ — b? x — pT )d™ +¢ 
1 acm)? ggtm) ) 2 
=e 5am" Ax(™ + seem aa) — bd! » 


+ 5 Axl) — b? xl +. ¢, 


T T 


Since the matrix A is symmetric and, for any two vectors u and v, u’ v = v* u, it 


follows that 


x67 pgm) — alm)? gT gir) — (4x) al) = gir)” 4y(m) 


and b?d(™) = d(™)"b. Therefore, 


f(xl™ 4 dd) = (5a aa) es (a~" Axi”) — a") A+é 


= (Ga aa”) D2 4 OM" pO) 4 &, 


where G6 = Exbon)™ 4x) ~b?x™ 4¢ and r6™ = Ax” —p. Note that rO™) is just 
the residual associated with the approximation x‘). The positive definiteness of 
A guarantees that d™)” Ad(™ > 0 for any d’™ 4 0, which implies that f(x(™ + 
Ad’) is an upward opening parabola in ». f(x") + Ad(™)) then achieves its 
minimum value when 

OF _ gly pgm) y 4 a) nl) = 0, 

OA 
Hence, ; 

(7 pm) 


bm = ger Ady 
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Choosing the Search Direction 


As the name of the method implies, the search directions in the conjugate gradient 
method are based on the gradient of f. Recall it was established above that 


Vf = Ax—b; 
thus 


V(x™) = Ax —b 


=), 


the residual vector. To simplify the notation in the derivation that follows, we will 
therefore use r&”) in place of V(x‘) when referencing the gradient. 

For the first iteration of the conjugate gradient method, the search direction 
is chosen as d) = —r®). ‘That is, we start from x and travel in the direction 
opposite to that of the gradient—standard arguments from multivariable calculus 
guarantee that this is the direction of maximum decrease in the value of f. All 
subsequent search directions are determined by 


amt) = pt 4g dl m>0. 


(m+1 


The scalar @m is chosen so that the search direction d ) is A-conjugate to the 


direction d™), 
Definition. Let A be a symmetric and positive definite matrix. Two vectors, 
u and v, are A-CONJUGATE if 


u? Av = 0. 


Thus, to determine the formula for am, we start by computing 


dirt)" age) — (—rimen) + amd)" Ad™ 
= =) gar) 4 gd)" 4a, 
For this last expression to be zero, we see that 


_ pim+)7 4aqim) 
Om = ~atm)F Agim) 


The matrix-vector product Ad(™ and the vector-vector product alr)? (Adi™) 
would already have been computed as part of the calculation of the step size Am, 
so the calculation of am requires only one additional vector-vector product and one 
division. : 

Gathering together the approximate solution update equation, the formula for 
the step size and the information regarding the construction of the search directions, 
we arrive at the following basic algorithm for the conjugate gradient method: 
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ri) = Ax) —b 
dO) = —,~©) 
form = 0, 1, 2,... 
dm = Ad)" hr) pgm)" Agim) 
x(M4+1) = x(m) 4 gir) 
pimtt) = Ax(mt+)) —b 
if Vrom+))? plm41) ¢ TOL, OUTPUT x0@t)) 
Orn = cP” Agl™) sg)" Ager) 
dimt+h) — —y(mt)) + Omal™ | 


The inputs to this routine are the coefficient matrix A, the right-hand-side vector b, 
the initial vector x and the convergence tolerance TOL. The first two lines set 
the initial search direction. Inside the iteration loop, the step size, the updated 
approximate solution, and the new residual are computed, in that order. If the 
length of the new residual is not below the convergence tolerance, the new search 
direction is computed in preparation for the next iteration. 


Making the Algorithm More Efficient 


The pseudocode that was presented above for the conjugate gradient method is 
not as efficient as it could be. Notice that each iteration requires the calculation 
of 2 matrix-vector products, 4 vector-vector products, and 3 vector additions. By 
constructing a formula that allows the new residual r+ to be calculated from the 
old residual r°”) and by taking advantage of the special properties of the residual 
vectors and search directions, the operation count for the conjugate gradient method 
can be reduced to | matrix-vector product, 2 vector-vector products, and 3 vector 
additions. 

Let’s see how r(™+) can be calculated from r™. Take the approximate 
solution update equation, x°"+) = x) + Awd"), multiply through by A, and 
then subtract b from both sides. This produces the equation 


Ax) _ 5 = Axl) — b+ dm Ad™, 
Applying the definition of the residual vector then yields 
mth) = pl™ 4 Adem), (1) 


Equation (1) still requires a matrix-vector product, Ad), but this is the same 
matrix vector product that is needed for the calculation of both A», and am. Us- 
ing (1) instead of Ax'"+1) —b to caleulate r+) will therefore reduce the number 
of matrix-vector products performed each iteration by one. 

Further efficiencies can be gained by developing different formulas for A; 
and @,. To derive these formulas, two special relationships between the residuals 
and the search directions are needed. First, take equation (1) and premultiply 
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by d&™". This gives 
dO)? pln +1) gC) lon) 4 gm)” aq (en) 
= dim)" p(m) _ (alm? ptm) jac)” Aa?) a)? ag) 
= dlr) pln) _ gr)? nm) _ gy 


so the previous search direction and the new residual vector are orthogonal. With 
Om, chosen to make the search directions A-conjugate, it turns out that the residual 
vectors are also orthogonal to one another: that is, 


"Y= 0 form # I. 
For a proof of this result, see Golub and van Loan [2]. 


Solving equation (1) for Ad’ gives 


Ad’™ — > eene = neal : (2) 


mM 
Premultiplying (2) by a)” then yields 


a)? Age) = san” [rons _ r(")] 
17 
_ = air plo 


Mm 


T 
=i [rom i amd) pm) 
mm 
— 2) yom)? (3) 
Xm 


In going from the first line to the second and from the third line to the fourth, the 
orthogonality of the previous search direction and the new residual vector has been. 
used. Equation (3) gives the new formula for \m: 


nlm)? pm) 
* aim)? Agen)’ 


The vector-vector product p()" -() will have been computed for the convergence 
test during the previous iteration, so the new formula for A», saves one vector-vector 
product over the original formula. 

Let’s now return to equation (2) and premultiply by r™+D"_ The result of 
this operation is 


emt? ager) =) plmty)7 [ren = rm] 
_ Eom 3)™ pm) (4) 


wean 
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where the term r(™+)"y(™) has been eliminated by applying the orthogonality of 
the residual vectors. Dividing (4) by (3) produces 
_ rim)? pon+1) 
Cm = ny pn) 
This formula reduces the operation count for each iteration of the conjugate gradient 
method by one more vector-vector product. 

Taking into account the new formulas for r+), A,, and Om, We arrive at an 
efficient algorithm for the conjugate gradient method: 


7) = Ax —b 


gq) = —,() 
set 500) = p(0)7 p(0) 
form =0, 1, 2,... 

set u = Ad(™) 


Nan = 66) /qir)? y 

x2) = x(m) 4) glen) 

emt) = rl) 4 

set 6M) = ymtl)7 p(m+1) 

if V6O™4+1) < TOL, OUTPUT xt) 
Om, = HM) 1§(™m) 

d(mt1) — —pl™+) 4 gd, 


This algorithm requires only one matrix-vector product, two vector-vector products, 


and three vector additions per iteration. 


EXAMPLE 3.25 Conjugate Gradient Method in Action 


Consider the linear system 


i) lye IPR. 230 at 
sis 4 O° Sta | |) 0 
= ae a fa Pe pl 

eee aafa =) 


Let’s take x =[0 0 0 0 in Then 


rO = Ax —p=-b=[1 0 -1 1]7 


and e 
d® =-r% =[-1 0 1 -1])'. 


We have three preliminary calculations to make before determining the step size Ap: 
§ (0) = pl0)7 (0) — 3; 


u=Ad® =[ -5/4 1/2 3/2 -5/4]"; and 
dO ya 4 
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Therefore, 
50 3 
= 50% 2 
and 


x9 = x 4 jd = [ 3/4 0 3/4 -3/4]". 


The residual associated with this new approximate solution is given by 


1 —5/4 1/16 
(2) = (0) 2A 0 3/ 1/2 |_| 3/8 

Ree eee esa Ge | mel oa fa I 
1 —5/4 1/16 


and 60) = rQ)7 20) = 21/128. Then 


6) 7 
0 = 50) ~ 198 
and 
-1/16 a ~15/128 
G) . _ pQ) (oy) _ | —3/8 Oe Micelles 
q Poked -13 | +i] 1 9/128 
-1/16 al ~15/128 


This completes the first iteration. 
To start the second iteration, only two preliminary calculations are needed to 
determine the step size Aj: 


u= Ad) = [ -3/512 -81/256 -3/256 -3/512 ]7; and 
dO)" y = 495/4096. 


Recall that the value 64) = rQ)7pQ) = 21/128 is available from the first iteration. 
Hence, 


y= OL 224 

1 dO ua 165’ 

3/4 -15/128 ~10/11 
(2). 4¢(2) a} 0 224 | -3/8 } _ | —28/55 
ES TOE Saya. LP res 26/108 36/55 

3/4 15/128 10/11 


and ; ee 
r@ =r) + \ru=[ 3/55 —3/55 6/55 3/55 |”. 
To complete the second iteration we compute 

63 


(2) — .(2)7 (2) — 63_ 
Pogue. ey anos 
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- 6) 384 


SLE a) 3098 


and 
A) = ~r® 4 aa = [ -42/605 21/3025 —357/3025 ~42/605 ]7. 
For the third iteration, we make the following calculations: 


u = Ad’) = [ ~126/3025 126/3025 -—252/3025 —126/3025 7; 
al?)” uy = 2646/166375; 


67) BB 
~ d@?y 42? 
x) =x 4 \od® =[ -1 -1/2 1/2 ~1]"; and 


X2 


r) =r%+4,u=[0 0 0 0]*. 

Thus, three iterations of the conjugate gradient method have produced the 
exact solution. Of course, all of our calculations were performed in exact arithmetic. 
What if we perform the calculations in fnite-precision arithmetic? The following 
table displays the output from an implementation of the conjugate gradient method 
in MATLAB. ; 


k x(*) 

0 [ 0.000000 0.000000 0.000000 0.000000 ]* 
1 [ 0.750000 0.000000 0.750000 -0.750000 |” 
2 [| -0.909091 —0.509091 0.654545 —0.909091 
3 


[ —1.000000 —0.500000 0.500000 —1.000000 


ie 
i 
The entries in x) are actually correct to sixteen decimal places. For comparison, to 


achieve similar accuracy, the Jacobi method requires 55 iterations, the Gauss-Seidel 
method 29 iterations and the SOR method (with w = 1.075) 17 iterations. 


Some Final Comments 


We observed in the example presented above that the conjugate gradient method 
produced the exact solution to the linear system in only three iterations. This was 
not the result of a “lucky” choice of the coefficient matrix and the right-hand-side 
vector. When working in exact arithmetic, the conjugate gradient method will 
always produce the exact solution to an 2 x n system in at most n iterations. To 
see why this happens, suppose, without loss of generality, that x!) = 0. With this 
choice for the initia] vector, it follows that 


m 
x0 = Soa 
k=1 
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for some scalars ¢), C2, C3, ...; Gm. By construction, the A-conjugate search direc- 
tions are also linearly independent (see Exercise 11), so the set 


{a, a, a®, - ae} 
forms a basis for R”. The exact solution of the linear system will therefore lie in 
span, pana, be ae 


for some m <n, and the iteration will terminate. 

In finite precision arithmetic, roundoff error will generally prevent the di- 
rection vectors from remaining conjugate and the residual vectors from remaining 
orthogonal. ‘The conjugate gradient method will still converge since each iteration is 
guaranteed to reduce the value of f, but convergence can be slow, particularly when 
the coefficient matrix is nearly singular. To handle this situation, it is common to 
modify, or precondition, the linear system prior to applying the conjugate gradient 
method. Preconditioning is accomplished by premultiplying the original system 
by the inverse of a nonsingular matrix, M, known as the preconditioning matrix. 
This multiplication converts Ax = b into APx? = b”, where AP = M-1AM, 
xP = M~'x and b” = M~'b. Determining an appropriate preconditioning ma- 
trix AZ is a nontrivial task, but a good choice for M can result in convergence after 
O(./n) or fewer iterations. For a detailed discussion of preconditioning of the con- 
jugate gradient method, consult Golub and van Loan [2], Ueberhuber [1], or Khosla 
and Rubin [3]. 

What if the coefficient matrix is nonsymmetric or not positive definite or 
both? A vast array of variants on the conjugate gradient method have been de- 
veloped to handle these more general Jinear systems. For example, the generalized 
minimal residual (GMRES) method (Saad and Schultz [4]) will converge in at most 
n iterations for an n x n system, but the computational cost and storage require- 
ments increase linearly with the iteration count. The biconjugate gradient (BiCG) 
method (Golub and van Loan [2]) replaces the orthogonal sequence of residuals 
by two mutually orthogonal sequences but no longer provides a minimization of 
the residuals, requires two matrix-vector products per iteration, often experiences 
irregular convergence behavior, and may even break down. The quasi-minimal 
residual (QMR) method (Freund and Nachtigal [5], [6]} converges about as rapidly 
as the GMRES method and typically exhibits convergence behavior that is much 
smoother than the BiCG method. The squared conjugate gradient (CGS) method 
(Sonneveld [7]) applies the basic operator of the BiCG method twice, so converges 
roughly twice as fast, but the convergence behavior is often erratic. The stabilized 
biconjugate gradient (BiGSTAB) method (van den Vorst [8]) can be interpreted 
as a combination of the BiCG method with a repeatedly applied GMRES method. 
This technique generally converges about as fast as the CGS method, but without 
the erratic convergence behavior. The two survey articles, Golub and O'Leary [9] 
and Broyden [10], would be good starting points for further study. 


Section 3.9 Conjugate Gradient Method 247 


References 


1. C. Ueberhuber, Numerical Computation 2: Methods, Software and Analysis, 
Springer-Verlag, Berlin, 1997. 

2. G, Golub and C. van Loan, Matriz Computations, 3rd edition, Johns Hopkins 
Press, Baltimore, 1996. 

3. P. K. Khosla and S. G. Rubin, “A Conjugate Gradient Iterative Method,” 
Computational Fluids, 9, 109-121, 1981. 

4. Y. Saad and M. Schultz, “GMRES: A Generalized Minimal Residual Algorithm 
for Solving Nonsymmetric Linear Systems,” SIAM Journal on Scientific and Sta- 
tistical Computing, 7, 856-869, 1986. 

5. R. Freund and N. Nachtigal, “QMR: A Quasi-Minimal Residual Method for 
Non-Hermitian Linear Systems,” Numerische Mathematik, 60, 315-339, 1991. 

6. R. Freund and N. Nachtigal, “An Implementation of the QMR Method Based 
on Two Coupled Two-Term Recurrences,” Technical Report 92.15, RIACS, NASA 
Ames, 1992. 

7. P. Sonneveld, “CGS: A Fast Lanczos-type Solver for Nonsymmetric Linear Sys- 
tems,” SIAM Journal on Scientific and Statistical Computing, 10, 36-52, 1989. 

8. H. A. van den Vorst, “Bi:CGSTAB: A Fast and Smoothly Converging Variant 
of Bi-CG for the Solution of Nonsymmetric Linear Systems,” SIAM Journal on 
Scientific and Statistical Computing, 13, 631-644, 1992. 

9. G. Golub and D. O’Leary, “Some History of the Conjugate Gradient and Lanczos 
Methods,” SIAM Review, 31, 50-102, 1989. 

10. C. G. Broyden, “A New Taxonomy of Conjugate Gradient Methods,” Comput- 
ers and Mathematics with Applications, 31, 7-17, 1996. 


EXERCISES 


In Exercises 1-4, solve the indicated linear system using the conjugate gradient method 
in exact arithmetic. Show that the exact solution is obtained in each case in three or 
fewer iterations. 


1. 32] - zr2 + 223 = —-6 
—Z) + 3272 + 23 = 3 
22, + 22 + 3243 = -4 
2. 4¢, -— 22 = 2 
-4, + 4r2 -—- 23 = 4 
- a2 + 443 = 10 
3. 64, - 222 + 343 = Ill 
—27, + 8¢2 + wg = -9 
3a, + ge + Te, = Y 
4, 34, + 22 - «£3 = 2 
zy + 472 + 223 = 7 
—2 + 222 + 523 = 6 
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In Exercises 5-10, use the conjugate gradient method to solve the indicated linear system 
of equations. Take x) = 0, and use a convergence tolerance of 5 x 107’. Compare 
the number of iterations required to achieve convergence with the number of iterations 
required by the Jacobi method and the Gauss-Seidel method using the same starting 
vector and convergence tolerance. For Exercises 7 and 8, also determine the number of 
iterations required by the SOR method. The optimal values of the relaxation parameter 
for Exercises 7 and 8 are w = 1.0923 and w = 1.1128, respectively. 


5. 


4, + 2 + os - 84 = 8 
zg. + 822 + 243 + 3rq = —12 
a + 2 + Bag ~- 24g = 18 
—2, + 349 ~- 23 + 444 = —20 
6. 
321 - - £5 = 3 
4z2 + 23 + 225 = 7 
—Zy Ze + 523 x6 = 6 
624 — 8 —- 2a6 = Il 
—2£y — aq + Tas + 226 = 1 
229 + 43 — 244 + 245 + 846 = TF 
7, 
7z, — 3xe = 4 
—37, + Sno + 223 = -6 
zm + 323 - £4 = 3 
—-2Z3 + 1024 + 425 = 7 
4tg + S25 = 2 
8. 
4m, — £2 — £a = -l 
—2, + 472 - 3&3 — 5 = 
—- «2 + 423 — aw = 1 
-Zy + 4%4 -— 25 = -2 
- 29 —- a@q¢ + 4%, - xe = L 
—- £3 — #5 + 4% = 2 
9. 
40) + a2 + 93 + a = 38/2 
a. + 3@2 - #3 + a = 1/2 
t1 o— 2 + 2x3 = 17/2 
zy + 2 + 324 = 27/2 
10. 
10x, + zo + 223 + Ste + 4e5 = 12 
a + 922 ~ 23 + 2m - 35 = —27 
Qn, - #2 + Tr3 + Seq - Sts = 14 
3) + Qao + 3x3 + deg - 23 = I? 
4c, — 3a. - 5&5 —- T4 + liv, = 12 
11. Let Abeann xn symmetric and positive definite matrix and Supposg, that the 
nonzero vectors V1, ¥2, V3, .--, Va form an A-conjugate set, that is, vj ? Av; =0 


whenever 7 # j. Show that 


Cy) +e2ve toav3 tt: tenvn =O 
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requires that c, = c2 =c3 = ++: = Cn = 0. Hence the set 
{V1,V2,V3)..+) Vn} 
is linearly independent and forms a basis for R”. 
12. A simpler choice for the search direction would be to set a™ ~ _~™ This 
amounts to always selecting the direction in which f decreases most rapidly in the 


vicinity of x) and produces what is known as the method of steepest descent. 
The resulting algorithm is summarized in the following pseudocode. 


2) = Ax) —p 
form = 0, 1, 2,... 
a) = pm) 


= = alr) pom) pqln)® 4 gm) 

x41) — cl) 4 gm) 

rim) — pl) 4 Aa) 

if Vert)? r™F1) < TOL, OUTPUT x(™*) 


Solve the linear systems in Exercises 5-10 using the method of steepest descent 

with x =Qanda convergence tolerance of 5 x 10-". Compare the performance 

of the method of steepest descent with that of the conjugate gradient method. 
13. The coefficients of the least squares cubic polynomial ap + aja + rox” + a3z" 


that fits the data 
ze 0 05 10 15 20 25 30 3. 


y 10 #17 21 20 11 09 14 3.1 
satisfy the linear system of equations 


8 14 35 98 ao 13.3 
14 35 98 292.25 a. | _ 25.45 
35 98 292.25 906.5 a2 | 67.625 
98 292.25 906.5 2887.8125 a3 202.6375 


Determine the values of ag, @1, a2, and a3. 


3.10 NONLINEAR SYSTEMS OF EQUATIONS 
Suppose we need to solve the system of three nonlinear equations 
a} — 222-2=0 
x3 —§224+7=0 
tote —~1=0. 


Although we cannot express this system in matrix notation because the equations 
are nonlinear, we can express the system in vector notation. First, define the 
functions 


filzi, 22,23) = 2} — 2a2 - 2; 
fo(a1, 22,23) = 23 — 523 +7; and 


f(@1, 22,03) = £203 — 1. 
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Note that each of these functions represents the left-hand side of one of the equations 
from the nonlinear system. Next, let x = [ 21 z2 23 ie and construct the 
vector-valued function 

fi(z1, 22,23) 

F(x) = | foley, 22,23) 

fa(21, £2,273) 
In terms of this vector-valued function, the original system of three nonlinear equa- 
tions can be expressed concisely as the single vector equation F(x) = 0. 

The problem of finding a vector x for which the vector-valued function F 
evaluates to 0 (i.e., the zero vector) is a generalization of the rootfinding problem 
which was investigated in Chapter 2. It ought to be possible, then, to modify one 
of the techniques developed in that chapter to suit our present needs. We will focus 
our attention on Newton’s method and Broyden’s method, which is a modification 
of Newton’s method. 


Newton's Method for a System of Nonlinear Equations 


Recall that given a scalar function, f, of a single scalar argument and given an 
initial approximation, xo, for a root of that function, Newton’s method computes a 
sequence of (hopefully} improved approximations to the root according to the rule 


Sk = ta ~ Tal fF Gal: 


Now, let F be a vector-valued function of a vector argument x, assuming that 
both vectors contain m components. To apply Newton’s method to the problem of 
approximating a solution of F(x) = 0, we would like to write 


x1) = x) _ B(x) B(x), 


This, however, brings up the immediate question of what is meant by F’(x™). 
First, F’(x(™) must include the derivative of each scalar component function with 
respect to each component of the argument vector. That’s m? individual partial 
derivatives. These partial derivatives should be organized so that dF = F’(x™)Ax 
provides an estimate for the change in F(x) when the argument changes from x to | 
x + Ax. From multivariable calculus we know that 
Of of of of 

df = —— a, + ~—Azo + ——Aag+-:-+ 7 Az 

if an, Oxy #3 an3,°° Op 
for a scalar function of m arguments, which suggests that the partial derivatives in 
F(x) be organized into matrix form as follows: 


Of, /Oz4 Of, /Oza Of, /Ox3 os Beat Of\ /OLm 
Ofo/O2, Ofo/Oxre Of2/Ox3 eS. GPE P Ofe/O2m 
Of3/O2, Ofs/Ox2 Of3/Ox3 Be ST ok Of3/Otm 


F’(x) = 


Ofm/Ox, fin [O02 Ofm/Ox3 + + + Ofm/O2m 
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This matrix is known as the Jacobian matriz for the system and is typically denoted 
by J(x). Having established that F’(x) is a matrix, this brings up a second question: 
how do we divide by a matrix? Simple. We multiply by its inverse. Thus, Newton’s 
method for a system of equations takes the form 


(Mth) — xf) [xy] F(x”), 


When implementing this scheme, we will not actually compute the inverse of 
the Jacobian matrix. Instead, we define 


-1 
gy | T(x ) F(x()), 
and then solve the linear system of equations 
Tey vi) = —F(x() 


for v'), Once v™) is known, the next iterate is computed according to the rule 
xt) = xlr) 4 yr), 


EXAMPLE 3.26 A System of Three Nonlinear Algebraic Equations 
Let’s apply Newton’s method to the system of three nonlinear algebraic equations 
a? —2¢9-2=0 
x? — 522 +7=0 
rons -1=0. 
Recall that this system is equivalent to the vector equation F(x) = 0, where 
fi(a1, 2,23) x} — 2x2 —2 
F(x) = | fo(ai,22,23) | = | v8 -5a3+7 |. 
fs(z1, 22,23) ox? ~ 1 
The Jacobian matrix associated with F(x) is easily found to be 


322-2 OO 
J({x)= | 32% 0 —1023 
0 ee 220223 


4 


Starting from the initial vector x =[1 1 1 ee we compute 


and 
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Solving the linear system [J (x )] v = —F(x() yields the update vector v(°) = 
[ 3/7 -6/7 3/7 ]*, and then x!) =x 4y© = [10/7 1/7 10/7 ]*. Con 
tinuing to iterate until the maximum norm of v™) js less than 5 x 1074, we obtain 
the results listed below. 


x(n? 
1.00000000000000 1.00000000000000 1.00000000000000 
1.42857142857143 0.14285714285714 1.42857142857143 
1.44011117287382 0.49305169538633 1.41331295163980 
1.44225533875822 0.50000806218205 1.41421499021415 
1.44224957033522 0.50000000001480 1.41421356237591 


PwWNHYE OB 


The exact solution to this system, in the neighborhood of the initial vector x) = 


Tv. 
[ob tak. tise Bes BO ie Thus, four iterations of Newton’s 
method have produced results that are correct to eight decimal places. 


Quasi-Newton Methods 


Newton’s method for a system of nonlinear equations provides the same quadratic 
convergence that was observed for scalar equations in Chapter 2. Unfortunately, the 
cost associated with each iteration of Newton’s method for a system of equations is 
quite high. Not only does each iteration require the calculation of F(x *)), which 
is equivalent to m scalar function evaluations, it also requires the evaluation of 
J(x“*)). This adds m? scalar function evaluations to the tally, one for each of the 
m? partial derivatives that make up the Jacobian. On top of all of these scalar 
function evaluations, an additional O(m?) algebraic operations are needed to solve 
the linear system 
J(x))y®) = — F(x) 


for the update vector v‘*). 


To reduce the per iteration workload, a variety of modified Newton methods, 
also known as inexact Newton methods and quasi-Newton methods, have been 
developed. These methods involve such procedures as using the Jacobian from 
the first iteration for all subsequent iterations, updating the Jacobian matrix only 
every kth iteration and performing a violent diagonalization of the Jacobian matrix. 
This last procedure produces a scheme that is similar in nature to the Gauss-Seidel 
method (Section 3.8) for linear systems. See Dennis and Schnabel [1], Eisenstat and 
Walker [2], and Ortega and Rheinboldt (3] for general discussions of quasi-Newton 
methods. . 

Below, we will develop one particularly popular quasi-Newton method, known 
as Broyden’s method ((4] and [5]}. Broyden’s method is based on a clever procedure 
for constructing an approximation to the inverse of the Jacobian matrix as the 
iteration proceeds. The resulting scheme not only reduces the number of scalar 
function evaluations required each iteration, but also dramatically decreases the 
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operation count associated with computing the update vector v). On the down 
side, Broyden’s method is only superlinearly convergent, as opposed to quadratically 
convergent. For most practical problems, the decrease in order of convergence is an 
acceptable trade-off for the reduction in computational effort. There are procedures 
which reduce the workload per iteration and maintain quadratic convergence, such 
as Brown’s method ({6] and [7]) and Brent’s method [8]. Moré and Cosnard [9] 
present a comparative study of some of the commonly used methods of this type. 
These methods, however, are considerably more difficult to implement efficiently 
than is Broyden’s method. 


Broyden’s Method 


The first iteration of Broyden’s method is almost identical to that of Newton’s 
method. Using an initial vector x, F(x) and J(x()) are evaluated. For later 
convenience, we will denote the Jacobian matrix J(x() by Ap. Next, compute 
Ag 1 Actually compute a matrix inverse? Yes! This will turn out to be one of 
those rare situations in which it is more efficient to compute the inverse than to 
solve the corresponding system of equations. The reason for this will be made clear 
below. Then multiply Aj? into F(x) to form v), Finally, update x" to x), 
For all subsequent iterations, Broyden’s method forgoes the calculation of the 
Jacobian. Rather, a matrix Ay is sought which approximates J(x(*)) in the sense 
that 
Ag(x) — xD) = Bal) — F(x), (1) 


In one variable, the condition imposed in equation (1) is equivalent to 


f (ae) — f(te-1), 


Lk — Lk-1 


f' (eR) 


hence, Broyden’s method is a multivariable version of the Secant method. Un- 
fortunately, (1) does not uniquely determine the matrix A, - it provides only n 
equations for the n? elements of A,. The best additional constraints (see Dennis 
and Schnabel [1]) are to require 


Au = Ap—1u for all vectors u such that (x) —x(A-))Ty = 0, 
Combining these conditions with (1) uniquely determines (Dennis and Moré [10]) 


ys Ag-14 ar 


Ar = Ag-1 a ATA > (2) 


where y = F(x‘) — F(x@-)) and A = x) —x(*-1), By using the matrix A, in 
place of J(x'*)), n? scalar function evaluations are saved each iteration. 

A more significant reduction in computational effort is achieved by making use 
of the Sherman-Morrison formula [10]: If M is a nonsingular matrix and u and v 
are vectors for which v’ M~'u # —1, then the matrix M + uv? is nonsingular and 


M—uv? M7! 


T\-l _ age) _ 
(MP aN) i ae (3) 


254 Chapter 3 Systems of Equations 


Identifying the matrix A,_, with M, the vector (y — Ax_-1A)/(A7A) with u and 
the vector A with v, equations (2) and (3) imply 


-1 
ApS (Ans oS aA a”) 
Az), (eee at) Ae, 
140741, (e434) 
(A- AgyATAR 


= Avi + 
k- = 
\ ATA Ly 


(4) 
Note that equation (4) uses only matrix-vector multiplication and hence provides 
a procedure for calculating Az’ directly from A,', in only O(n*) algebraic op- 
erations. With Ay! known, the need to solve a linear system at each iteration 
to compute the update vector v‘*) ig eliminated; instead, v*) is determined by 
forming the matrix-vector product A;*F(x“)). Thus, each iteration of Broyden’s 
method will require only O(n?) algebraic operations, as compared to the O(n°) 
algebraic operations required by Newton’s method. 


EXAMPLE 3.27 A System of Three Nonlinear Algebraic Equations 
Revisited 


Let’s apply Broyden’s method to the same system of nonlinear algebraic equations 
that we investigated earlier, F(x) = 0, where 


fi(z1, 22, 23) 23 — 2t_ —2 
F(x) = | fe(t1,22,03) | = | #}-503+7 
f3(&1, £2, @3) argu} — 1 


Recall that the Jacobian matrix associated with F(x) is given by 
3a? -2 0 
J(x)= | 322 0  -10z3 
0 a 22923 


Starting from the initial vector x =[1 1 1 l’. we compute 
T 
F(x) =[ -3 3 0] 
and 
3-2 
Ap = J(x) =| 3 0 10 
0 2 - 
It follows that 
1 10 4 = 20 
st'=—| -6 6 30], 
Ao = 
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, | 10 4 20 -3 3/7 
vy) = AS F(x) = ote 3 | =| -6/7 
3-3 6 0 3/7 


and x!) = x) + y =| 10/7 1/7 10/7 |”. 
For the next iteration, we start by computing 


0.62973760932945 
F(x) = | —0.28862973760933 
—0.70845481049563 
and 
3.62973760932945 
y = F(x!) — F(x!) = | —3,28862973760933 
—0.70845481049563 


and also noting that A =v). To compute Ay? according to equation (4), we will 
need the intermediate results 


—1.49437734277384 


0.21366097459392 
Ag'y = 
0.39296126613911 


A? 45} = [ 0.25510204081633 —0.11224489795918 —0.34693877551020 ] 


and A? Aj'y = 1.54087582554888. It then follows that 


—0.038735881841877  0.09643788010426 0.57080799304952 


0.27367506516073 —0.07958297132928 0.42780191138141 
Ayl= ; 
0.07732406602954  —0.07402258905300 0.13483927019983 


0.15370485292292 
v') = —AT R(x) = | 0.45575276157379 
0.02546853667618 


and 
1.58227628149435 
1.45403996524761 


x) =x 4 yO) = | 0.59860990443093 


Continuing to iterate until the maximum norm of v‘”) is less than 5 x 1074, 
we obtain the results listed below. As expected, Broyden’s method requires more 
iterations than does Newton’s method, but each iteration is cheaper in terms of 
computational effort. The J,.-norm of the difference between x‘ and the exact 


solutionx=[ Y3 1/2 V2 ie is 1.038 x 107°. 
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NOoOoPwNnNdre os 


1.00000000000000 
1.42857142857143 
1.58227628149435 
1.35508942048637 
1.44643226689446 
1.44288531438848 
1.44207572772478 
1.44225995004753 


Systems of Equations 


x(n)” 
1.00000000000000 
0.142857 14285714 
0.59860990443093 
0.49542848694730 
0.50532150828807 
0.49855503157583 
0.50011152020649 
0.49999540342064 


1.00000000000000 | 
1.42857142857143 
1.45403996524761 | 
1.41165011868362 | 
1.40994166582347 | 
1.41541277217107 | 
1.41412254801611 ] 
1.41421807651088 ] 


Application Problem 1: Coupled Reversible Chemical Reactions 


In the Chapter 3 Overview (see page 141), we developed the system of equations 


Cc, + C2 


(Ag — 2c: — €2)*( Bo — ¢1) 


Cy + C2 


ky = ee 
: (Ap — 2c1 — ¢2)(Do — ¢2) 


and ko = 
for the number of moles, c, and cz, of a chemical C produced at equilibrium by the 
coupled chemical reactions 

2A+ BHC 


AtD=C. 


The subscript 1 refers to the first of these reactions, and the subscript 2 refers to 
the second. The parameters Ag, Bo, and Do are the initial number of moles of 
the chemicals A, B, and D, respectively, injected into the reaction chamber. The 
equilibrium constants for the two reactions are denoted by k, and kp. 

Suppose Ag = 20 moles, Bo = Do = 10 moles, ky = 1.63 x 1074, and 
ko = 3.27 x 1073. Substituting these values into the equations for c, and cz and 
rearranging the terms yields 


¢) + 2 ~ 1.63 x 10° 4(20 — 2c; — e2)?(10 — 1) = 0 

€) + C2 — 3.27 x 1077(20 — 2c — c2)(10 — co) = 0. 
Applying Newton’s method to this system of nonlinear equations, with an initial 
guess of x =[ 0.5 0.5 le and a convergence tolerance of 5 x 107°, the following 
values were obtained after four iterations: 


¢, = 0.10987 and co = 0.49001. 


Therefore, there are 0.10987 ++ 0.49001 = 0.59988 moles of C’' present at equilibrium. 


Application Problem 2: Flow Distribution in a Pipe Flow Network 


The following diagram shows a pipe network through which water at 20°C is flowing. 
Given that the pump produces an outlet pressure of 5.2 x 10° Pa, we would like to 
determine the volumetric flow rates (measured in m3/s) through each pipe in the 
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vn 2 


5.2% 10° Pa 93 


200 m 


7 45 


network. These quantities are labeled qi, 92, 93, 94, 95, G6, and gy in the diagram. 
The length of each pipe is also listed. 

The analysis of a pipe network such as this is similar to the analysis of an 
electric circuit. We focus on junctions and loops. At each junction, the rate at which 
fluid enters the junction must equal the rate at which fluid leaves the junction. 
Starting with the junction at the upper left and proceeding clockwise about the 
network, we obtain the equations 


gq — ga - 96 = 0; 
go — 93 — G4 = 0; 
qa + 44-9 =0; and 
95 + G6 — Gr = 9. 


Around any loop in the network, the sum of the pressure drops around the 
loop must equal zero. The pressure drop along each pipe is due to friction and is 
given by the Darcy-Weisbach equation (see White [11]) 


8fpL 4 
rede 


Here, f is the Darcy friction factor, p is the density of the fluid, Z is the length 
of the pipe, g is the volumetric flow rate, and d is the inside diameter of the pipe. 
Suppose that all of the pipes in this network have a friction factor f = 0.02 and an 
inside diameter d = 0.2 m. At 20°C, water has a density of 998 kg/m°. Traveling 
clockwise around the rightmost, middle, and leftmost loops in the network and 
dividing the resulting equations by ae yields 


pressure drop = 


20093 — 7593 = 0; 
100g2 + 759? + 100g? — 75g? = 0; and 


2(0.2)8 
100g? + 7593 + 50g? — 5.2 x 10° ; oe) 


(0.02)(998) — vi 


Combining the junction equations with the loop equations produces a set of 
seven nonlinear equations in the seven unknown flow rates. Broyden’s method was 
applied to this system, with an initial guess of g; = 0.1 for each 7. Iterations were 
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terminated when the maximum norm of the difference between successive iterates 
fell below 5 x 1076. A total of nine iterations were required to compute 


r 0.2388 
0.0869 
0.0330 
q= | 0.0539 
0.0869 
0.1519 
| 0.2388 
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EXERCISES 


1. For each of the following nonlinear systems, write out the vector-valued function 
F associated with the system and compute the Jacobian of F. 


() 1a = 0 (b) lta2-e" = 0 
zy +22 - 24 = 0 zji-t2 = 0 
(ec) 221 -—322+2%3-4 = 0 
22, +2%2-23+4 = 0 


ve tag t+a3—4 = 0 
2. For each of the nonlinear systems in Exercise 1, carry out two iterations of 
Newton’s method. Use the initial vector indicated below. 
(a) x=[9 3] 
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(b) x =[1 1]? 
(c) x@=s[-1 -3 3 ihe 


3. For each of the nonlinear systems in Exercise 1, carry out two iterations of 
Broyden’s method. Use the initial vectors indicated in Exercise 2. 


In Exercises 4-10, solve the indicated nonlinear system of equations using both Newton’s 
method and Broyden’s method. Use the indicated initial vector, and terminate the 
iteration process when the maximum norm of the difference between successive iterates 
falls below 5 x 107°. Compare the number of iterations required by the two methods 
to achieve convergence. 


4, 5cosrtbcos(r+y)—10 = 0 x =[07 07 ]7 
5sine+6sinfe+y)—4 = 0 
5. o4+a24+23-1 = 0 xO=[1 1 1]7 
xe +2%-0.25 = 0 
x? + a2 — dog = 0 
6. ce +23+e3-10 = 0 x) =[2 0 2]7 
2j+2r2-2 = 0 
21 +3273-9 = 0 
7 o4102e-y—-5 = 0 see ahs 
a+y>—-lWyt+1 = 0 
8. t+ 5001 +23 +23-200 = 0 gO a2 * 
a? + 20t2+23-50 = 0 
2? — 234+ 4023+75 = 0 
9. 2e—cosy = O x = [0 0" 
2y-snz = 0 
10. a22-dx; = 0 x%=[7 2]7 
xt 1 us 
Tray 322 = 


11. The following systems have the indicated number of solutions. Approximate each 
of the solutions to within a convergence tolerance of 5 x 107°. 


(a) 
— — 
ae 7 ie 4 oe ; 2 solutions 
(b) 
4%, -29+2%3-21%4 = O 
—21 + 329 — 223 -20%4 = O ; 
21 — 240+ 3823-24324 = O 4 solutions 
ot 2 +22-1 = 0 
(c) . 
2 = = 
a Pa ep 2 solutions 
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12. 


13, 


14. 


15. 


16. 
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The filter coefficients—h, ha, h3, and ha—for the Daubechies wavelet of length 
4 are solutions of the system 


hy tho +hg + ha = V2 

hy ~hot+h3 - hg =0 
3h, — 2he +h3 = 0 

AL+hp+h3t+hg=1. 


Determine h1, ho, hg, and hq. 


(a) Repeat the “Coupled Reversible Chemical Reactions” application problem 
from the text using Broyden’s method. Use the same initial vector and 
convergence tolerance as were used in the example. 

(b) Repeat the “Flow Distribution in a Pipe Flow Network” application problem 
from the text using Newton’s method. Use the same initial vector and 
convergence tolerance as were used in the example. 


Repeat the “Coupled Reversible Chemical Reactions” application problem chang- 
ing the parameter values to Ag = 5 moles, Bo = 2 moles, Do = 1 mole, 
ky = 4.25 x 107?, and kp = 0.286. 


Suppose 3 moles of chemical A, 2 moles of chemical B, and 1 mole of chemical D 
are injected into a one-liter reaction chamber and the coupled chemical reactions 


A+2B=2C 
2A4+D=C 


are allowed to proceed to equilibrium. The equilibrium constants for the reactions 
are ky = 1.00 x 107? and ke = 5.12 x 107%. How many moles of C’ are present 
at equilibrium? 

The diagram given below shows a pipe network through which water at 20°C is 
flowing. Given that the pump produces an outlet pressure of 4.1 x 10° Pa and 
that all of the pipes have a friction factor of f = 0.00225 and an inside diameter 
of d = 0.15 m, determine the volumetric flow rates (measured in m?/s) through 
each pipe in the network. 


gq, 100m 9q2 150m 


4.1 x 10° Pa 


q7 gs 150m 


CHAPTER 4 


Eigenvalues and Eigenvectors 


AN OVERVIEW 
Fundamental Mathematical Problem 


In this chapter, we will develop a variety of techniques for approximating the eigen- 
values and eigenvectors of an n x n matrix. Recall that an eigenvalue of a matrix A 
is any number, typically denoted by A, for which the equation Av = Av has a 
nonzero solution for the vector v. Since the equation Av = Av is equivalent to 
(A—AI)v = 0," we see that the eigenvalues of A are those values of \ for which the 
matrix A— XJ is singular; that is, those values of 4 for which det(A - A) =0. 

As a function of A, det(A — AJ) is an nth-degree polynomial, known as the 
characteristic polynomial of A. Counting multiplicities, an n x n matrix therefore 
has precisely n eigenvalues. Furthermore, the coefficients of the characteristic poly- 
nomial are sums and products of the elements in A. If A is a real matrix, it then 
follows that the eigenvalues must be real or occur in complex conjugate pairs. The 
collection of all eigenvalues of a matrix is known as the spectrum of the matrix. 

The nonzero vector v for which Av = Av is called an eigenvector of the 
matrix A associated with the eignevalue A. Since v is a solution to the matrix 
equation (A —AJ)v = 0 when A—ATI is singular, eigenvectors are not unique. They 
_ are, however, determined up to a multiplicative constant. In other words, if v is 
an eigenvector associated with the eigenvalue A, then av is also an eigenvector 
associated with the same eigenvalue, for any nonzero constant a. 

The “Steady-State Distribution of the British Workforce” problem capsule 
from the Chapter 1 Overview (see page 4) is one example that gives rise to an 
eigenvalue problem. Here is another example. 


Measuring the Student Experience 


Table 4.1 summarizes the correlations among seven different measures of the “stu- 
dent experience” for the four-year colleges and universities in the Commonwealth 
of Virginia. The measures include the percentage of first-year students who re- 
turn for their second year (RET), the percentage of classes with fewer than 20 
students (< 20), the percentage of classes with more than 50 students (> 50), the 
percentage of classes taught by full-time faculty (FTFAC), the average number of 
years needed to graduate in the current graduating class (GTIMBE), the percent- 
age of first-time full-time students who graduate within six years (GRATE), and 
the donation rate for alumni (ALUM). Suppose we want to use the information in 
Table 4.1 to construct two or three composite measures, or indices, of the student 
experience, thereby making it easier to compare different institutions, 
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RET < 20 > 50 FTFAC GTIME GRATE ALUM 
RET 1.0000 -0.2411 0.4931 0.3009 —0.6865 0.9493 0.7538 
< 20 —0.2411 1.0000 —0.5535 —0.0387 0.1256 —0.1698 0.0684 
> 50 0.4931 -—0.5535 1.0000 —0.2095 -—0.1546 0.3972 —0.0643 
FTFAC | 0.3009 -0.0387 -0.2095 1.0000 —0.2357 0.3994 0.4033 
GTIME | -0.68654 0.1256 —0.1546 -0.2357 1.0000 —0.77614 —0.7330 
GRATE| 0.9493 -0.1698 0.3972 0.3994 -0.7761 1.0000 0.7601 
ALUM 0.7538 0.0684 ~—0.0643 0.4033 -0.7330 0.7601 1,0000 


TABLE 4.1: Correlations Among Measures of the Student Experience at Four-Year Colleges and 
Universities in Virginia 


Let FR denote the 7 x 7 matrix of correlation values given in Table 4.1. The 
eigenvectors of R, scaled to unit length in the lj-norm, are called the principal 
components of the original set of variables. Note that each principal component 
represents a specific linear combination of the variables. Moreover, the principal 
components are uncorrelated. Consequently, the principal components are ideal 
candidates for the composite measures we seck. 

But which principal components should we use? The variation in the original 
data is divided among the principal components. In particular, the percentage of 
variation accounted for by each principal component is given by the ratio of the 
associated eigenvalue to the number of variables. It is therefore standard procedure 
to rank the principal components according to the size of the eigenvalues of R. 
The largest eigenvalue is associated with the first principal component, the next 
largest eigenvalue with the second principal component, and so on. In order to 
capture as much of the variation in the original data as possible, we should then 
choose the first few principal components as our composite measures of the student 
experience. 


Remainder of the Chapter 


Section 1 focuses on the power method, which is used to determine what is known as 
the dominant eigenvalue and its associated eigenvector. The inverse power method, 
which can be used to approximate the smallest eigenvalue of a matrix or to approxi- 
mate the eigenvalue nearest to a given value, is described in Section 2. This method 
also produces an estimate for the associated eigenvector. The topic of deflation, 
which involves transforming a matrix so as to “remove” a previously determined 
eigenvalue from the spectrum, is discussed in Section 3. In the final two sections of 
the chapter, techniques for simultaneously approximating all of the eigenvalues of a 
symmetric matrix are presented. We first consider the reduction of symmetric ma- 
trices to tridiagonal form (Section 4) and finally discuss determining the eigenvalues 
for symmetric tridiagonal matrices (Section 5). 
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Localizing Eigenvalues 


Before launching into the development of numerical techniques for approximating 
eigenvalues, let’s consider an important analytical result. For obvious reasons, this 
theorem is known as the Gerschgorin Circle Theorem. 


Theorem. Let A be an n x n matrix and define 7; = ee jz |24y| for each 
¢=1,2,3,...,n. Further, let 


C,={zE€C:|z—ayl <r}, 


where C denotes the complex plane. 

1. If A is an eigenvalue of A, then X lies in one of the circles Ci. 

2. If k of the circles C; form a connected region R in the complex plane, 
disjoint from the remaining n — & circles, then the region FR contains exactly 
k eigenvalues. 


Proof. Let 4 be an eigenvalue of A with associated eigenvector x. Define 
r= pe jdt ja;;| for each ¢ = 1,2,3,...,n. Further, let & be an index for 
which |z,%| = ||xlloo. Equating the kth elements in the eigenvalue relation 
Ax = Ax yields 


n 
O4j0; = ALk, 
j=l 


or 


k-1 n 
(A = Okk Lk = Sant + S° AkjXj- 
j=l 


jHk+1 


Hence, upon taking the absolute value and repeatedly applying the triangle 
inequality, 


A 


k-1 n 
|A - Onn ||rx| < SS Onj25| + ye Anji L5 
j=l j=k+i 


k-1 n 


IIxXlloo > lang] + [Ilo D5 lang 


j=l gekti 


1A 


= Tk||Xleo- 


From here it follows that |A — ax%| <4, 80 A € Cy and the first part of the 
theorem is proven. Ortega [1] contains a very readable proof of the second 
conclusion, oO 
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EXAMPLE 4.1 Localizing Eigenvalues using the Gerschgorin Circle 
Theorem 


Consider the matrix 


1 -1 0 
A= 1 5 ol 
—2 -1 9 


Proceeding row by row, we find that the radii of the Gerschgorin circles for this 
matrix are 
ry =|-1]+]0] =1 
ro =|1) 4/1) = 2 
rz =|-2)+|-1)=3. 


The circles are therefore given by 


Cy = {ze C:|z-1) <1} 
Co ={zE€ C:|z-5| < 2} 
C3 = {z EC: |z-9| < 3} 


and are plotted in Figure 4.1. Note that C is disjoint from the other circles, which 
implies that one of the eigenvalues must be contained in C,. Furthermore, since A 
ig a real matrix, the eigenvalue in Cy must be real and therefore must lie on the 
closed interval [0,2]. On the other hand, circles Cp and C3 overlap. Their union 
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must therefore contain the two other eigenvalues of A. These final two eigenvalues 
could be two real eigenvalues or a complex conjugate pair. 


References 


1. J. Ortega, Numerical Analysis—A Second Course, Academic Press, New York, 
1972. 


4.1 THE POWER METHOD 


Some matrix eigenvalue problems require the computation of a single eigenvalue, 
others the computation of several eigenvalues and yet others the computation of all 
of the eigenvalues. The corresponding eigenvectors may or may not also be needed. 
To handle each of these situations efficiently, we will need to develop several different 
solution strategies. In this section we will introduce the power method, an iterative 
technique for locating what is known as the dominant eigenvalue of a matrix. The 
power method also computes an associated eigenvector. In subsequent sections we 
will introduce extensions to the power method which allow for the computation of 
other eigenvalues. 


Derivation of Method 


Let A be an nxn matrix with eigenvalues A1, A2, A3,.-., An, not necessarily distinct, 
that satisfy the relations |A;| > |A2| = |A3| > --- > |An|. The eigenvalue 1, 
which is largest in magnitude, is known as the dominant eigenvalue of the matrix 
A. Furthermore, assume that the associated eigenvectors v), v2, V3,-..,Wn are 
linearly independent, and therefore form a basis for R®. It should be noted at 
this point that not all matrices have eigenvalues and eigenvectors which satisfy the 
conditions we’ve assumed here. At the end of the section and in the exercises, we 
will explore what happens when these conditions are violated. 

Let x be a nonzero element of R”. Since the eigenvectors of A form a 
basis for R”, it follows that x) can be written as a linear combination of v1, v2, 
V3,---)Vnj that is, there exist constants 1, @2, Q3,..-,Q» such that 


x9) = QV, + A2V2 + O3Vg +-°> + OnVn- 


Next, construct the sequence of vectors {xim) } according to the rule x(™ = 
Ax'™-)) for m > 1. By direct calculation we find 


x = Ax = ay(Av1) + a2(Ave) + a3(Av3) +--+: + On(Avn) 
= a1 (A1V1) + a2(A2ve) + a3(Asv3) ++ + On(AnVn); 

x?) = Ax) = A2x(0) 
= a1(A*v1) + @2(A?v2) + a3(A?v3) ++ + On(A?vn) 
= a1 (AZV1) + @2(AZV2) + 03(A$va) +--+ On (ARVn) 


266 Chapter 4 Eigenvalues and Eigenvectors 


and, in general, 
xm) — Ag(m2) amy (0) 
= @)(A™ v1) + a2(A™va) + a3(A™v3) +--+ On(A™vn) 
= a4 (A7"v1) + a2(Ag va) + 3(AZ'va) + +--+ an (AR Vn) 
In deriving these expressions we have made repeated use of the relation AV; = 


A;vj, which follows from the fact that v,; is an eigenvector associated with the 
eigenvalue A;. 


Factoring J" from the right-hand side of the equation for x gives 


Oe ae AN Aas 
x™ =P lavitos(2} vetas(2) vet---tan(@ Val (1) 
Mt AL Ar 


By assumption, |Aj/A1| < 1 for each 7, so |Aj/Ai|" > 0 as m > oo. It therefore 
follows that . 


Sn 
Im = = @V),. 
ses wr ivi 


Since any nonzero constant times an eigenvector is still an eigenvector associated 
with the same eigenvalue, we see that the scaled sequence {x'™/)7"} converges to 
an eigenvector associated with the dominant eigenvalue provided a, 4 0. Further- 
more, convergence toward the eigenvector is linear with asymptotic error constant 
|A2/A1|- 

An approximation for the dominant eigenvalue of A can be obtained from the 
sequence {x™)} ag follows. Let i be an index for which «{”~») 4 0, and consider 
the ratio of the ith element from the vector x’ to the ith element from x(™—)), 
By equation (1), 

a) Maura [1+ O(Oo/\1)™)] 


4 


ol) Tagan [1 + O ((A2/Mn)*-})] 


=), (10 (Gay)? Vs 


provided v1, 4 0, where v;,; denotes the ith element from the vector vy. Hence, the 
ratio #6”) /2(*-") converges toward the dominant eigenvalue, and the convergence 
is linear with asymptotic rate constant |A2/A1|- 

To avoid overflow and underflow problems when calculating the sequence 
{x} (note that limm—oo 7 + +00 when |Ai|.> 1, whereas limmoo A” — 0 
when |A;| < 1), it is common practice to scale the vectors x”) so that they are all 
of unit length. Here, we will use the /,.-norm to measure vector length. Thus, in a 
practical implementation of the power method, the vector x) would be computed 
in two steps: First multiply the previous vector by the matrix A and then scale the 
resulting vector to unit length. 

To simplify the notation, let’s introduce the vector y“™) to denote the result 
of multiplying by the matrix A; that is, y’™) = Ax(™-)). x) is then calculated 
by the formula 


yen) 


so 
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where Pm is an integer chosen so that |yS™| = |ly°)|loo- Note that pin is an index 
into the vector y‘"), Whenever there is more than one possible choice for the 
index Pm, we will adopt the convention of always selecting the smallest value. The 
vector x'™ now converges specifically to that multiple of v, which has unit length 
measured in the infinity norm. As for the eigenvalue, since x‘"-)) is approximately 
an eigenvector associated with Ay, y) = Ax(™—) = Ayx(™-))_ By construction 
a6) — 1, s0 it follows that y!") , converges to Ay. 

The power method is an iterative scheme, so a convergence tolerance must be 
specified and a stopping condition implemented. Three possibilities for the stopping 
condition immediately come to mind. Iteration could be terminated when 


Nm) — NOD) TOL; when 
[xO — xD). < TOL; or when 
Ax™ — x) < TOL, 


where TOL denotes the specified convergence tolerance and \'") is used to denote 
the approximation to the eigenvalue during the mth iteration. These conditons 
represent checking for convergence of the eigenvalue, checking for convergence of 
the eigenvector and checking for convergence of the residual, respectively. 

Checking for convergence of the residual most accurately reflects the un- 
derlying mathematical problem, as it measures how closely the eigenpair satisfies 
the eigenvalue equation, but using this convergence condition can be problematic. 
Knowing that eigenvalues are the roots of a polynomial, it should not be surprising 
that eigenvalues can be very ill conditioned. As with linear systems, where it was 
found that Ax* —b can be a poor estimator of the error in x* when A is ill condi- 
tioned, the residual Ax™ — \C")x(™) can be a poor estimator of the error in A°™ 
and x‘™ when the eigenvalue J is ill conditioned. Consult Golub and van Loan [1] 
or Ueberhuber [2] for details. Checking for convergence of the eigenvalue is the 
least computationally expensive option, but it is possible for the correct eigenvalue 
to be determined while x‘ js still far from the true eigenvector (see Exercise 9). 
For these reasons, we choose to implement the stopping condition on convergence 
of the eigenvector. 


EXAMPLE 4.2 A Demonstration of the Power Method 


Consider the matrix 


-2 -2 3 
A=] -10 -1 6 |, 
10-2 -9 
whose eigenvalues are A; = —12, Ax = —3, and A3s = 3. Let’s start with the 


vector x = [1 0 0 ie which already has an infinity norm of 1. Since the 
first element in x) is the only element that has an absolute value of one, we set 


po = 1. 
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For the first iteration of the power method, we compute 
y) = Ax —[-2 -10 10]’, 


from which we obtain our first estimate for the dominant eigenvalue: x\@) = yp a 
yf? = —2. Note that the infinity norm of the vector y") is 10. Sticking with our 
convention of selecting the smallest index for which the magnitude of the vector 
element is equal to the infinity norm of the vector, we take p, = 2. Therefore, for 
the second iteration, we have 


() 
a0 oS ane 
x seri [1/5 1 -1]°. 


The calculations for the second iteration produce the results 
y) = Ax = ( -27/5 -9 9], 
2) — y? = yo? =-9, 
pa = 2 


and 


(2) 
x) === [3/5 1-1)". 
The third iteration then produces 


y) = Ax® = [ 31/5 -13 13 ]7, 
3) _ ,,(3) — ,,(3) _ 
AQ) = yf = yp = — 18, 
p3=2 
and @) 

Gye a2 5 oe ed 
x aa [ 31/65 1 lige 

The following table displays the output from the 11 iterations of the power 


method needed for the eigenvector to converge to within a tolerance of 5 x 107°. 
The final estimates are 


dy & ~12.000014 and v, ~ [ 0.500000 1.000000 ~1.000000 ]”. 


The eigenvalue estimate is in error by roughly 0.0001%, while the eigenvector is 
correct to the digits shown. 
The values in the column headed “Convergence” were computed according to 


the formula 
a) — )G-D 


AG—)) — \G=2) |" 
This quantity is an estimate for the asymptotic rate of linear convergence of the se- 
quence {\)} toward the value 4, = —12. Note the values in this column approach 
the value predicted by theory: |\2/A,| = 3/12 = 0.25. 
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j x0)" NO) Convergence 
0  { 1.000000 0.000000 0.000000 ] 

1 0.200000 1.000000 —1.000000 -2.000000 

2 | 0.600000 1.000000 —1.000000 -9.000000 

3 | 0.476923 1.000000 —1.000000 -13.000000 0.571429 
4 | 0.505882 1.000000 -—1.900000 ~11,769231 0.307692 
5 | 0.498537 1.000000 +—1.000000 -12.058824 0.235294 
6 0.500366 1.000000 —1.000000 -11.985366 0.253659 
7 | 0.499908 1.000000 ~1.000000 -12.003663 0.249084 


8 [ 0.500023 1.000000 -—1.000000 -11.999085 0.250229 
9 0.499994 1.000000 --1.000000 -12.000229 0.249943 
10 [ 0.500001 1.000000 —1.000000 -11.999943 0.250014 
11 [| 0.500000 1.000000 —1.000000 -12.000014 0.249996 


Variation for Symmetric Matrices 


When the matrix A is symmetric, a slight modification to the power method pro- 
vides more rapid convergence: convergence is still linear, but the asymptotic rate 
constant is smaller. The modifications to the basic algorithm consist of using a 
different norm to scale the vectors x’ and using a different formula to compute 
the eigenvalue estimate. These changes are based on the following theorem, which 
is discussed in most textbooks on linear algebra (see, for example, Anton and Ror- 
res [3], Lay [4], Leon [5], or Shifrin and Adams [6]). An elementary proof can be 
found in Anton and Rorres [3]. 


Theorem. If A is an n X n symmetric matrix, then there exists a set of 


nm eigenvectors v1, V2, V3,.--,Wn that are orthogonal with respect to the 
standard inner product on R”; that is, whenever 2 # j 
viv; = 0. 


Recall that for arbitrary vectors x,y € R”, the standard inner product is the 
scalar quantity x? y (or y?x), and associated with this inner product is the norm 
vx?x. When written in component form, 


2 Meee 
VxTx = (>: #) 
i=1 


we can readily see that this is just the ly, or Euclidean, norm. 

To exploit the orthogonality of the eigenvectors of a symmetric matrix within 
the power method, we will measure vector length and scale the vectors x{™ to unit 
length using the Euclidean norm. Furthermore, we will compute an estimate for the 
dominant eigenvalue using the standard inner product as follows. Premultiplying 
both sides of the relation y(™) = \yx(™-)) by x(m—1)F yields x (m1)? y(m) ~~] 
\yx(™-D7 x(M=-1) = Ny since x07 4-1) — 1 by construction. Putting these 
changes together, we arrive at the power method for symmetric matrices: 
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let x be a nonzero element of R® with x(7x(0) — 1. Form = 
1,2,3,..., calculate 
yo) = Ax(™-)) 
Ar) = xl) yn) ang 


xm) — yi) | fy lm)? yl), 


Then \"™ — d and x’) converges to an eigenvector associated with Ay that has 
unit length in the Euclidean norm. 

What about the convergence rate for this version of the power method? Using 
equation (1) and the orthogonality of the eigenvectors, it follows that 


yen a n fe med 
1 1V1 + Dine Oi x vi 


xim—1) = 
2m-2 n dN meat - m-1 
MN ayVvy + Shae any (3) vi lav, + ee Oy (+) v 
= Ovi +- pee an(As/A 1) vg 
VEVT Vt + Die Ff (Ai/A1) ev vi 
Then 


xT ym) — x(m=1)F gy (m—1) 
LM [atviva + ee 02 (Ai /A1)?"™ WF vi] 
[otvivi + Sieg FP / A)?" Avi vi] 


=e {1 +0 ((r2/A1)20"-) } 


and x(@-1)" yl) _, dy linearly with asymptotic rate constant |A2/A1|°. 

While the eigenvalues of a general matrix can be very ill conditioned, the 
eigenvalues of symmetric matrices are well conditioned. In particular, we have the 
following remarkable result. 


Theorem. Let A be ann x n symmetric matrix, let 4 be a real number, let 
x be an n-vector with ||X\[2 = 1 and define r = Ax — Ax. Then A has an 


eigenvalue A with ; 
Js- AL < [ele 


Thus, for a symmetric matrix, the l>-norm of the residual vector r provides an 
upper bound on the error in the eigenvalue estimate A. A proof of this theorem 
will be developed in Exercise 21. Unfortunately, even if ||r|lz is small, the error in 
the eigenvector estimate % may still be large (see Exercise 11). We will therefore 
continue to implement a stopping condition based on convergence of the eigenvector 
sequence. 
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With this variation of the power method for symmetric matrices, there is 
one glitch which can arise when checking for the convergence of the eigenvector 
sequence. When ; < 0, the sequence {x‘”)} won’t converge to a multiple of vi, 
but will alternate between a multiple of v, and its opposite. To circumvent this 
“convergence problem,” we can compare the norm of x(™) — sgn(\1)x(™—») to the 
convergence tolerance, rather than the norm of x(™) — x(m-1)_ 


EXAMPLE 4.3 Demonstration of Power Method for Symmetric Matrices 
Consider the 4 x 4 symmetric matrix 

9.5 —2.5 -2.5 —15 

—26 56 15 2.6 


2.6 1.4 5.6 25 |’ 
-15 25 25 5.5 


A= 


whose eigenvalues are A, = 12, Az = 4, Ag =4, and Aq = 2. The eigenvector asso- 
ciated with A; that has unit Euclidean norm is vy = [ —1/2 1/2 1/2 1/2 ie 


We will start the iteration with the vector x = [0.5 0.5 0.5 05 ]’. 
Note that 
x6" (9) = (0.5)(0.5) + (0.5)(0.5) + (0.5)(0.5) + (0.5)(0.5) = 1, 


so this vector is already properly scaled in the Euclidean norm. For m = 1, we 
calculate 
yi) = Ax =[ 05 35 35 45)", 
NO) = x" yO = (0.5)(—0.5) + (0.5)(3.5) + (0.5)(3.5) + (0.5) (4.5) 
= 9.0 


and 


VyMtyQ® — /(-0.5)(—0.5) + (3.5)(3.5) + (3.5)(3.5) + (4.5)(4.5) 
_[-05 35 35 45 ]° 
BB 


Continuing on to the second iteration, we find 


A eee a [-05 35 35 45 ]7 


y) = Ax) = —_[ -27 37 37 43 ]’, 


jw 
Salt 
on 


Keke i geh= (+0.5)(+27) + (3.5) (87) + (8.5)(87) + (4.5) (43) 


(3V5)(3V5) 


466 
=e = 10.355556 
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and 
mn T 
la aye | 27 37 387 43] 
Vy" y2) J Can20 +80 6+ N(8D-+US) (48) 


_ [27 37 37 43)" 
2/1329 


The table below displays the results of these first two iterations, together 
with the next eight. Calculations were terminated when the Buclidean norm of the 
difference between successive approximations to the eigenvector fell below 5 x 1075, 
The eigenvalue estimate is correct to the digits shown, while the eigenvector is 
correct to four digits in the first and fourth components and to all digits shown in 
the second and third components. The estimate for the asymptotic rate of linear 
convergence given in the last column of the table is in excellent agreement with the 
theoretical value: |A2/A;|? = (4/12)? = 1/9. 


gous 


j x)" : 9) Convergence 
0 0.500000 0.500000 0.500000 0.500000 ] 

1 [ —0.074536 0.521749 0.521749 0.670820 ] —_ 5.500000 

2 [| —0.370315 0.507469 0.507469 0.589761 ] 10.355556 

3 [ -0.460013 0.501622 0.501622 0.533985 ] 11.799850 0.297452 
4  [ —0.487194 0.500309 0.500309 0.511882 ] 11.977899 0.123278 
5 [ -0.495812 0.500056 0.500056 0.504042 ] 11.997556 0.110403 
6 [ —0.498617 0.500010 0.500010 0.501360 ] 11.999729 0.110531 
7 [ 0.499541 0.500002 0.500002 0.500455 ] 11.999970 0.110923 
8 [ —0.499847 0.500000 0.500000 0.500152 ] 11.999997 0.111059 
9 [ —0.499949 0.500000 0.500000 0.500051 ] 12.000000 0.112098 
10 [ —0.499983 0.500000 0.500000 0.500017 ] 12.000000 0.111108 


Application Problem 1: Age Demographics of a Female Population 


The Leslie mode] describes the dynamics of the female portion of a population. 
In this model, females are divided into age classes of equal duration. Suppose 
there are a total of m distinct age classes. The birth and death processes which 
control the future evolution of the population are described by the parameters o, 
t= 1,2,3,...,n) and 6; (i =1,2,3,...,n —1), which measure the average number 
of daughters born to each female in the ith age class and the fraction of females in 
the ith age class who survive into the (i + L)st age class, respectively. 

If x\*) denotes the distribution of the population among the age classes at 
some time t = t,, then the distribution at time t = ty41, denoted by the vector 


x@+)) is given by 
BED os Bl, 
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where the Leste matriz, L, has the form 


a a2 a3 ° ° ; an 
b O 0 - - . 0 
0 b O-: - ‘ 0 
ex : : os teed ‘ 
O. - ~ + 0 ba-y 0 


It is required that the elapsed time between ¢ = t, and t = tg4, be equal to the 
width of the age classes. The dominant eigenvalue of the Leslie matrix indicates the 
growth rate of the female population. The coordinates of the associated eigenvector 
indicate the steady-state distribution of the population across the age brackets. 

The following parameters for a sheep population were obtained by Caugh- 
ley [7] from data collected by Hickey [8,9]. 


Age (years) a by 
) 0.000 1.000 
1 0.045 0.845 
2 0.391 0.824 
3 0.472 0.795 
4 0.484 0.755 
5 0.546 0.699 
6 0.543 0.626 
7 0.502 0.532 
8 0.468 0.418 
9 0.459 0.289 
10 0.433 0.162 
11 0.421 


Using the power method on the 12 x 12 Leslie matrix corresponding to this data, 
with x =[1 1 2 11112111 1 1)" and TOL =5x 10%, 
produces 

Ay & 1.08999 


and 


x ~ [ 1.00000 0.91744 0.71123 0.538767 0.39216 0.27164 
0.17420 0.10004 0.04883 0.01872 0.00496 0.00074 i 


Twenty-six iterations were required to achieve convergence. The eigenvalue indi- 
cates that the population in each age bracket will increase by 8.999% each year. 
Rescaling the eigenvector so that the sum of the components is one, we find that 
the model predicts 23.9% of the female population in the 0-1 age class, 22.0% in 
the 1-2 age class, 17.0% in the 2-3 age class, and so on. 
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Application Problem 2: Eigenvalues and Undirected Graphs 


Geometrically, an undirected graph consists of a set of marked points, called ver- 
teces, together with a set of lines which connect pairs of vertices, called edges. For 


example, Figure 4.2 displays an undirected graph with seven vertices (labeled 1, 2, 
3, ..., 7) and nine edges. 


Figure 4.2 An undirected graph. 


Two vertices that are connected by an edge are said to be adjacent. For 
instance, vertices 1 and 2 are adjacent in Figure 4.2, but vertices 1 and 3 are not. 
The overall adjacency structure of an undirected graph can be summarized in an 
adjacency matriz. For an undirected graph with n vertices, the adjacency matrix, 
which we shall denote by A, is the n x n matrix whose elements are defined by 


pfs 1, vertex 7 is adjacent to vertex 7 
“4 0, otherwise. 


The adjacency matrix for the undirected graph in Figure 4.2 is 


010001 1 
101000 0 
01001 1 0 
A=);0 00010 0 
00 11 0 1 0 
101031041 
100001 0 


Applying the power method for symmetric matrices to A produces the estimates 


A; & 2.86081 
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and 


vi © [ 0.406691 0.290865 0.425420 0.134554 
0.384933 0.541244 0.331352 ]”. 


What significance does the dominant eigenvalue of an adjacency matrix have? 
Suppose we wish to assign a color to each vertex in an undirected graph in such a 
way that adjacent vertices are assigned different colors. Such an assignment is called 
a proper coloring of a graph and may be used by, say, a mapmaker to determine 
how to color the geographic regions on a map so that regions that share a common 
border receive different colors. The minimum number of colors that can be used in 
a proper coloring of a graph is called the chromatic number and is denoted by x. 
It can be shown that es <x < 1+), where n is the number of vertices in the 
graph and ), is the dominant eigenvalue of the adjacency matrix. See Cvetkovic, 
Doob, and Sachs [10] for a proof of the lower bound and van Lint and Wilson [11] 
for a proof of the upper bound. Hence, for the graph in Figure 4.2, 1.69 < y < 3.86, 
or siuce the chromatic number must be an integer, 2< x < 3. 

Next, suppose the vertices in an undirected graph represent cities and an edge 
represents the existence of a direct traveling route between two cities. Geographers 
have shown that the entries in an eigenvector associated with the dominant eigen- 
value of the adjacency matrix provide a measure of the accessibility of the cities 
(Straffin [12]). Thus, since the sixth entry in v, is the largest, vertex 6 represents 
the most accessible city. Further, since the first, third, and fifth entries of v, are 
roughly equal, the cities represented by vertices 1, 38, and 5 are nearly equal in 
terms of accessibility. Finally, as might have been expected, the city represented 
by vertex 4 is the least accessible. 


Some Final Comments Regarding the Power Method 


Throughout this section, we have made several important assumptions. First, we 
assumed that the eigenvalues of A satisfy the relations |A,| > |A2| > |A3| > +--+. > 
lAn|; that is, that A possesses a unique dominant eigenvalue of multiplicity one. 
Second, we assumed that the associated eigenvectors V1, V2, V3,..., Vm are linearly 
independent. Finally, we assured that when the vector x‘) is written as a linear 
combination of the eigenvectors, 


x0) & ayy, + deve + 03V3 +--+ + OnVn, 


the coefficient a, is nonzero. We will now discuss what happens when these as- 
sumptions are violated. Demonstrations of these points will be provided in the 
exercises. 

Let’s start with the last of our assumptions. What happens if a, = 0? This is 
not 4 major problem. Eventually, the roundoff errors produced during the iterations 
will generate a vector x‘) for which a, # 0. We can also reduce the likelihood 
that a, = 0 from the outset by using a random number generator to select the 
components of x), 
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If A has a unique dominant eigenvalue of multiplicity one but does not possess 
a set of n linearly independent eigenvectors, the power method will generally still 
converge to the dominant eigenvalue. In some cases, however, depending on the 
choice of x, the power method may converge to a non-dominant eigenvalue. 
What if A does not have a unique dominant eigenvalue; for example, |Ai| = |Aa| 
but A, # Az? In some instances, the eigenvalue sequence, M™, will converge to one 
of the dominant eigenvalues. Generally, however, neither the eigenvalue sequence 
nor the eigenvector sequence, x‘), will converge. 

Finally, what happens when A has a unique dominant’ eigenvalue of multi- 
plicity greater than 1? Suppose that Ay = Ap = Az = --- = Ap. If this eigenvalue 
has a set of r linearly independent eigenvectors, then the eigenvalue sequence of 
the power method will converge to Ay. The eigenvector sequence will converge to 
a linear combination of the r linearly independent eigenvectors, with the specific 
linear combination dependent on the choice of x). On the other hand, if 4, does 
not have a set of r linearly independent eigenvectors, then the power method will 
generally fail to converge. In some cases, the power method will still converge, but 
the convergence will be extremely slow. 

Clearly, the power method has its drawbacks. However, if |A2/A;| « 1, the 
power method is an efficient method for obtaining the dominant eigenvalue and an 
associated eigenvector. It is also important to note that to implement the power 
method, we only need to be able to calculate the matrix-vector product Ax(™), It 
is not actually necessary to store the matrix A in ann x n array. Therefore, the 
power method can be of interest for large, sparse matrices. 
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EXERCISES 


Tn Exercises 1~7, a matrix A and a vector x are given. Perform five iterations of the 
appropriate version of the power method. 


1. 


2. 


3 2 -2 - 
a=[-3 -1 g | ansx=[1 0 0] 


Ear -.0 

15 7 -7 = 
A=|-1 1 1 | amtx®= [1 0 0] 

13 7 -5 

1 04 -06 e 
A=|-04 1 04 | snd = [2 1 1] 


-0.6 0.4 1 


19 ~9 -6 n 
A=|] 25 -ll -9 | andx=[0 0 1] 
17-9 «=—4 
1 4 °5 i 
A=|4 -3 0 |andx%=[1 0 1] 
5 0 7 
fio -4 0 -4 
—-4 -5 90 T 
AS | “G5 and x) =[ 1/2 1/2 1/2 1/2 | 
ees ci eee) ae 
(1025 00 0 0 
0 025 0 1 0.25 0 
_|0 0 O 0 6.25 0 (0) Tv 
A=|g 92500 0 o | mex? =[1 91 0 1 oO] 
0 0.25 1 0 0.25 0 
£9 O O 0 0.25 1 
. Since the sequences {x’™)} and {A"™)} converge linearly, convergence can be 


accelerated by applying Aitken’s A*-method. Discuss how you would incorporate 
Aitken’s A?-method into the power method. Consider both the generic form and 
the variation for symmetric matrices. 


. For the following matrices, use the power method with a randomly selected 


initial vector and a convergence tolerance of 5 x 107® to estimate the dominant 
eigenvalue and its associated eigenvector. How many iterations are needed for 
convergence of the eigenvector? Compare this with the number of iterations 
required for convergence of the eigenvalue. Had convergence of the eigenvalue 
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10. 


11. 


12. 


13. 
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been used for the stopping criterion, what would the error have been for the 
eigenvector estimate? 


42 -2 2 

ge he ey 1.36 0.48 0 
(a) A= 00 2 0 (b) A=| 048 1.64 Q 

a ee ee ae: 


In the example “Demonstration of Power Method for Symmetric Matrices,” we 
estimated the dominant eigenvalue and an associated eigenvector for the matrix 


a9 —2.5 ~-2.5 —-1.5 
—2.5 5.5 15 9 2.8 
—2.5 1.5 5.5 2.5 
—-15 2.5 2.5 5.5 


A= 


using the variation of the power method designed for symmetric matrices. Using 
the generic power method algorithm, recompute the dominant eigenvalue for 


this matrix. Take x! = [ 1 iii \’ as the starting vector and use a 
convergence tolerance of 5 x 107°. Compare the performance of the generic 
power method with that of the symmetric matrix variation. 


Consider the matrix 


The eigenvalues of A are \, = 1+ .€ and Ay = 1—e. The corresponding 
eigenvectors that have unit length in the [o.9-norm are v1 = [ 1 1 ie. and 
va=[1 -1]?.LetA=landz=[1 0]”. 

(a) Calculate the residual vector r = AX — AX. 

(b) Compare the norm of r with the norms of  — v, and * — vo. 


12 3 
A=10 4 6 |, 


0 0 4.001 | 


Consider the matrix 


whose dominant eigenvalue is clearly 4.001. Perform 20 iterations of the power 
method starting with a randomly selected vector x) How does the Ioo-norm 
of the residual vector r = Ax!20) — (20) (20) compare with the absolute error 
in A@0? 


Each of the following matrices has a unique dominant eigenvalue of multiplicity 
one but does not possess a complete set of linearly independent eigenvectors. 
Use the power method to determine the dominant eigenvalue and an associated 
eigenvector for each matrix. 


ae Takata 
(a) A= i e CBS oe ob 0 Ag 
001 2 
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14. Each of the following matrices has a unique dominant eigenvalue of multiplicity 


15 


16. 


17. 


18. 


19, 


greater than one, and the eigenvalue does possess a complete set of linearly 
independent eigenvectors. Use the power method with several different randomly 
selected initial vectors and observe that the eigenvalue sequence converges to the 
same value each time but the eigenvector sequence converges to different vectors. 
so og 
0 5 =O | (b) A=] _ oy . 
1 -l 

1 -1 8 3 


(a) A= 


Each of the following matrices has a unique dominant eigenvalue of multiplicity 
greater than one, but the eigenvalue does not possess a complete set of linearly 
independent eigenvectors. Apply the power method with a randomly selected 
initial vector. Limit calculations to at most 20 iterations and comment on the 
behavior of the eigenvalue and eigenvector sequences. 


000 3 11 
2 (b) A= 


ooo 
OrrrH 
Nr ooo 


1 
0 
0 


The following matrices do not have a unique dominant eigenvalue. Apply the 
power method with a randomly selected initial vector. Limit calculations to at 
most 20 iterations and comment on the behavior of the eigenvalue and eigenvector 
sequences. 


01010 

Oe 0 0 10 
(a) A=|0 101 0 (b) A= i 0-6 

1.410) ie OF st Or 2 

0. 1.0 1 <0 


Suppose that a particular insect has a lifespan of five years and that the param- 
eters of the Leslie matrix for this insect are 


Age Bracket a b; 


0-1 0.0 0.7 
1-2 0.0 0.9 
2-3 12 0.9 
3-4 2.3 0.6 
4-5 0.9 


Determine the annual growth rate of the female population and the steady-state 
distribution of the female population among the age brackets. 


For each of the undirected graphs in Figure 4.3, write out the corresponding ad- 
jacency matrix and then determine the upper and lower bounds on the chromatic 
number of the graph and rank the accessibility of the vertices. 


In studying loggerhead sea turtles, Crouse, Crowder and Caswell (“A Stage- 
Based Population Model for Loggerhead Sea Turtles and Implications for Con- 
servation,” Ecology, 68 (5), 1412-1423, 1987) developed a variation of a Leslie 
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wy Ix 


Figure 4.3 Undirected graphs for Exercise 18. 


model that involved the matrix 


0 0 0 0 127 4 80 
0.6747 0.7370 0 0 0 0 0 
0 0.0486 0.6610 0 0 0 0 
M= 0 0 0.0147 0.6907 0 0 0 
0 0 0 0.0518 0 0 0 
0 0 0) 0 0.8091 0 ) 
0 0 0 0 0 0.8091 0.8089 


The dominant eigenvalue and its associated eigenvector for this matrix have the 
same interpretation as for a Leslie matrix. Determine the annual growth rate of 
this population and the steady-state distribution of the population among the 
various Classes. 

20, Let A be an n X 2 symmetric matrix. 

(a) Show that if A is symmetric positive defintite, then all of the eigenvalues of 
A are positive. 

(b) Show that if all of the eigenvalues of A are positive, then A is symmetric 
positive definite. 

21. Let A be an n x n symmetric matrix, let d be a real number, let * be an n- 
vector with ||%||z2 = 1 and define r = AX — Ak. We will establish that A has an 
eigenvalue A with i 

IA- Aj < lIr[le. 
Since A is symmetric, there exists a set of n eigenvectors vi, v2, V3,.--,Wn 
which are orthogonal in the standard inner product on R”™. We can therefore 


write a 
k= 5) Avis 
i=l 
for some constants 31, 82, 63,.--,Bn- 
(a) Show that 1 = ||x 2 = SO Be \|vil|3- 


(b) Show that r = ee BOGE Nvi, where the \; are the eigenvalues of A. 
2 


(¢) Show that [jr > (min cien [di — 41) 
(d) Use (c}) to deduce that A has an eigenvalue A with |A—A| < |[rlJ2. 
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4.2 THE INVERSE POWER METHOD 


The power method is designed to approximate the dominant eigenvalue (the eigen- 
value that is largest in magnitude) of a matrix. There are many instances, however, 
in which an eigenvalue other than the dominant one is needed. For example, the 
buckling load of a beam, the fundamental vibrational frequency of a structure, and 

_ the ground state of a quantum operator require the eigenvalue that is smallest in 
magnitude. In a more general setting, we may have an estimate for an arbitrary 
eigenvalue, obtained perhaps using the Gerschgorin Circle Theorem, and want to 
determine a more accurate approximation. In this section, we present a technique, 
the inverse power method, for addressing these problems. 


Some Theory 


To derive the inverse power method, we will need the following result, which relates 
the eigenvalues of a matrix A to a class of matrices which can be constructed from A. 


Theorem. Let A be an n x n matrix with eigenvalues A, Ae, A3,..-,An and 
associated eigenvectors v1, V2, V3,-.-,Vn- 


1. If B= aol +a,A + ag A? +----+ QnA” = p(A), where p is the polynomial 
p(x) = a9 +ayz + agz? +--+ +amr™, then the eigenvalues of B are 


p(x), P(A2), pts), ne »P(An) 


with associated eigenvectors v1, V2, V3,--.,Vn- 
2. If A is nonsingular, then A~ has eigenvalues 


Le ek Sy Tae 


Me os AR 


with associated eigenvectors vy, V2, V3,---)Vn- 

Proof. Let A be ann x n matrix with eigenvalues A;, A2, A3,...,An and 
associated eigenvectors V), V2, V3,---,Vn- 

Part 1: 


Note that for any positive integer k, 


Aty; = Abel (Av;) = A,APo ly; 
\A*-? (Av,) = AP AR 2y, 


il 


= bea (Av;) = AEV;. 


Now, let B = apf +a, A+a2A?+-+>+QmA™ = p(A), where p is the polynomial 
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P(Z) = ap + az + ag2? +--+ a,2™. Then, for each i = 1,2,3,...,n, 


By; = (aol + 1A + 024? +--+ + OmA™)v; 


agVi + @, Av; + az A°v; + tan A™V; 


= Ggvi + G1 AGv; + OQARVy + + Om A Vi 
(a9 + aA; + ar? ht Om AP Vy 
= PA) vi- 


Hence, the eigenvalues of B are 


POA), p02), POs), nae P{An) 


with associated eigenvectors v), v2, V3,---,Vn- 


Part 2: 
Suppose A is nonsingular. Since v; is an eigenvector associated with the 
eigenvalue A,, it follows that 


AN; = rj Vu 


Premultiplying this equation by (1/A;)A7' yields 


i —1/ Ay. i -1l/\ y. 
x4 (Avi) = v4 (Aavs) , 


or F 
iy = Aaivy. 
Therefore, for each i = 1,2,3,...,n, 1/A; is an eigenvalue of A~!, with asso- 
ciated eigenvector v;. 4 
The Method 
Once again, let A be an n X n matrix with eigenvalues A), A2, A3,...,An and 
associated eigenvectors Vj, V2, V3,...,Wn. Let g be any constant for which A — gI 


is nonsingular (this will hold true for any g that is not an eigenvalue of A), and 
consider the matrix B = (A —qlI)~1. As a consequence of the theorem we just 
finished proving, the eigenvalues of B are 


ee! 1 ail et 
PF eg hak eS ror bn = 


with associated eigenvectors v1, V2, V3,---)Wn- 

If we apply the power method to the matrix B, the eigenvalue estimates Ar) 
will converge to the dominant eigenvalue, say tz,. Note, however, that s, will be 
the dominant eigenvalue of B if and only if Ay is the eigenvalue of A that is closest 
to the number g. Hence, if by some means we determine that A has an eigenvalue 
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in the vicinity of g, we can obtain an approximation to that eigenvalue by applying 
the power method to the matrix B = (A —ql)~}. This procedure is known as the 
inverse power method. 

An implementation of the inverse power method can be obtained from code 
for the power method with only a few modifications. First, an extra input value, 
the number q, must be included in the parameter list. Second, the operation y°™) = 
Ax'™-) must be replaced by y’™ = (A — qI)7!x(-))_ Of course, in practice, 
we will solve the linear system (A — ql)y™ = x'™—)) for y&"), Since the matrix 
A— ql does not change during the iteration process, the factorization of A—q/ can 
be computed once prior to the iteration loop and only the solve step (forward and 
backward substitution) need be performed with each iteration. Third, remember 
that the sequence {AC} converges to 4 = (Ax—q)~?. To obtain an approximation 
to Ay, we must compute (1/\°™)) +g. The eigenvectors of A and (A — qI)~} are 
the same, so no manipulation of the sequence {x} is necessary. 

Note that (A — gI)~! is symmetric whenever A is symmetric. The inverse 
power method can therefore be implemented with both the general version of the 
power method and its variation for symmetric matrices. What about the con- 
vergence of the inverse power method sequences? As above, suppose that px is 
the dominant eigenvalue of (A — g/)~], and further suppose that jy is the second 
largest eigenvalue. The sequences {x} and {A™} then converge linearly with 
asymptotic error constant 


a At = 4 
Ollus/usl) = 0 (|=4)} 
for general matrices and with asymptotic error constant 
Mea” 
O 0 
(e/a) ( 4 


for symmetric matrices. Hence, convergence of the inverse power method depends 
not only on the separation of the eigenvalues, but also on the accuracy of the 
estimate q. 

EXAMPLE 4.4 A Demonstration of the Inverse Power Method 


Consider the 5 x 5 matrix 


121 1 0 8 
<1 3 0.1 0 
A= 1 0 -6 2 1 
02 1 9 #9 
1 0 1 0 ~2 


The Gerschgorin circles for A are plotted in Figure 4.4. Each circle C; corresponds 
to the ith row from the matrix. Note that circle Cp = {2 €C:|z—3| <2} is 
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10 


8r 4 


Z 2 

= eet J 
c 

6 4 

“BF 4 

-0 + — = — —__ er — 

-0 5 0 40 18 20 


5 
Re(2) 


Figure 4.4  Gerschgorin circles for the matrix 


12 L 1 0 3 
-l 3 0 2 #90 
A= 1 0 -6 2 J] 
0 2 21 9 O 
1 0 1 60 ~2 


disjoint from the other four circles and hence is guaranteed to contain one of the 
five eigenvalues. Furthermore, since A is a real matrix, the eigenvalue in Cy must 
be real and therefore must lie in the interval [1, 5]. 

Unfortunately, as is clear from Figure 4.4, the eigenvalue in C2 is not the 
dominant eigenvalue of the matrix, so the power method will not locate it. However, 
the inverse power method can. Let’s take q = 3, since this is the center of the 
Gerschgorin circle. With a starting vector of 

eS [oid Ae) 
and a convergence tolerance of TOL = 5 x 107°, only five iterations are needed to 


compute 
d & 2.779638 


and 


v x | -0.087621 1.000000 —0.084234 —-0.307983 —0.035955 ie 


Table 4.2 displays the output of each iteration. 
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x(i)7 3+ 1/d0) 

| 1.000000 1.000000 1.000000 1.000000 1.000000 ] 
[ —0.180952 1.000000 -—0.068452 —0.360119 0.005952 ] 4.750000 
| ~0.087393 1.000000 ~0.083210 0.306325 -0.033860 | 2.781069 
[ —0.087658 1.000000 ~-—0.084217 -—0.308045 —0.035867 | 2.779612 
[ —0.087622 1.000000 -~0.084233  -—0.307981 —0.035952 ] 2.779641 
[ -0.087621 1.000000 —0.084234 -0.307983 -0.035955 ] 2.779638 


ap wn PrP Om. 


TABLE 4.2: Tabie for Example 4.4. 


Eigenvalue Smallest in Magnitude 


As noted in the introduction to this section, there are many practica] applications 
that require approximating the eigenvalue of a matrix that is smallest in magni- 
tude. How do we proceed in this case? Recall that each eigenvalue of A7? is the 
reciprocal of an eigenvalue of A. It follows that the dominant eigenvalue of the ma- 
trix A-1 corresponds to the eigenvalue of A that is smallest in magnitude. Thus, to 
approximate the eigenvalue that is smallest in magnitude, we can apply the inverse 
power method with g = 0. 


EXAMPLE 4.5 Approximating the Eigenvalue Smallest in Magnitude 


In the previous section, we used the power method to approximate the dominant 
eigenvalue, \ = 12, for the matrix 


5.56 -2.5 -2.5 ~1.5 
25 55 15 2.5 
25 15 55 2.6 
-15 25 25 5.5 


A= 


Here, we will use the inverse power method, with gq = 0, to approximate the 
eigenvalue that is smallest in magnitude. The table below lists the output from 
the 15 iterations needed to obtain convergence of the eigenvector with a con- 
vergence tolerance of TOL = 5 x 1075, starting from the initial vector x0) = 


[0.5 0.5 0.5 0.5 ]”. The final estimates are 
Amin 2.00000 


and 
Ve [ 0.500031 0.500000 0.500000 —0.499969 ie 


For comparison, the exact eigenvalue is \ = 2, and the exact eigenvector with unit 
length in the fz norm is v = [ 1/2 1/2 1/2 -1/2 lie 
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x9)" 1/rW 

0.500000 0.500000 0.500000 0.500000 | 

0.741620 0.471940 0.471940 0.067420 | 3.692308 
| 0.693774 0.484333 0.484333 —0.222531 ] 2.435424 
| 0.613172 0.494640 0.494640 —0.366991 ] 2.118843 
| 0.559931 0.498442 0.498442 —0.435417 | 2.030804 
| 0.580668 0.499577 0.499577 —0.468229 | 2.007783 
| 0.515488 0.499889 0.499889 —0.484246 2.001951 
| 0.507780 0.499971 0.499971 —0.492156 | 2.000488 
0.503898 0.499993 0.499993 —0.496086 ] 2.000122 
0.501951 0.499998 0.499998 -0.498045 | 2.000031 
0.500976 0.500000 0.500000 -—0.499023 | 2.000008 
0.500488 0.500000 0.500000 —0.499512 2.000002 
' 0.500244 0.500000 0.500000 --0.499756 2.000000 
/ 0.500122 0.500000 0.500000 -0.499878 2.000000 
[ 0.500061 0.500000 0.500000 —0.499939 2.000000 
| 0.500031 0.500000 0.500000 —0.499969 2.000000 


Gib tw Dr BbDOMANmMAaARONH OL’ 
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Application Problem 1: Steady-State Distribution of the British Workforce 


In the Chapter 1 Overview (see page 4), we discussed a model of the British work- 
force. Recall that the workforce population was divided into the seven occupational 
classes 


. higher-grade professionals/administrators; 

. lower-grade professionals /administrators and higher-grade technicians; 
. routine nonmanual employees; 

. small proprietors; 

. lower-grade technicians and supervisors of manual laborers; 

. skilled manual laborers; and 

7. semiskilled and unskilled manual laborers, 


oak wnr 


and the transition matrix between the classes was estimated to be 


[ 0.452 0.291 0.184 0.126 0.142 0.078 0.065 
0.189 0.231 0.157 0.114 0.136 0.088 0.078 
0.115 0.119 0.128 0.080 0.101 0.083 0.082 
P= | 0.077 0.070 0.078 0.244 0.077 0.065 0.066 
0.048 0.096 0.128 0.087 0.157 0.123 0.125 
0.054 0.106 0.156 0.144 0.212 0.304 0.235 

0.065 0.087 0.169 0.205 0.175 0.259 0.349 | 


Each entry, :;, represents the proportion of children born to parents in occupational 
class 7 who became members of occupational class 7. 

Assuming that this transition matrix remains valid over time, the steady 
state distribution of the workforce among the indicated classes, denoted by 7, is 
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the eigenvector of P that is associated with the eigenvalue 1 and whose elements 
sum to one. Applying the inverse power method to the matrix P with gq = 1.01, an 
initial vector of x =[1 1 1 111 1 le and a convergence tolerance of 
TOL =5 x 10~°, four iterations are required to produce the eigenvector estimate 


[ 1.0000 0.6945 0.4952 0.4253 0.5187 0.8561 0.9351 ]”. 


This is not quite the vector that we want, because the sum of the entries is not 
equal to one. Rescaling this vector, we find the steady state workforce distribution 
vector to be 


m= [ 0.2030 0.1410 0.1005 0.0864 0.1053 0.1738 0.1899 ]”. 


Hence, if the state transition matrix given above were to remain valid over time, we 
would expect roughly 20.30% of the population to be employed as higher-grade 
professionals/administrators, 14.10% to be employed as lower-grade profession- 
als/administrators and higher-grade technicians, 10.05% to be employed as routine 
nonmanual employees, and so on. 


Application Problem 2: Critical Buckling Load 


Consider a long, slender rod of length LZ, as shown in the diagram below. The 
rod is clamped at one end and hinged at the other. At the hinged end, the rod 
is compressed axially by a force P. As long as the magnitude of P remains below 
a certain critical load, P.,, the rod will remain straight. However, for P > Pu, 
the rod will deflect from the straight configuration, or buckle. We would like to 
determine both Po, and the shape of the buckled configuration. 


“a p 


Let X denote distance along the length of the rod, with X = 0 corresponding 
to the clamped end. Further, let w(X) denote the deflection of the rod from its 
straight configuration. The differential equation for w is 


a aw d dw 
P =0 1 
dX? (e152) * ax ( ) (1) 
where E& is Young’s modulus and I is the moment of inertia of the rod’s cross 


section. See Timoshenko and Gere [1] for a derivation of this equation. At the 
clamped end, the rod can neither deflect nor rotate, so 


dw 


(0) = 9; (2) 
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at the hinged end, the rod cannot deflect and no bending moment can be imparted, 
sO 

ae 
~ dX? 


Introducing the nondimensional independent variable = X/L, equations (1), (2), 
and (3) become 


w(L) (L) =0. (3) 


w+ dw" =0, w(0) = w'(0) = w(1) = w"(1) = 0, (4) 


where primes denote differentiation with respect to x and A = sal 
To approximate the solution of (4), we will start by dividing the interval {0, 1] 


into three equal sized pieces. Then we will assume that 


5 


w(x) =) > cdi(a), (5) 


i=1 


where the c’s are constants to be determined and the ¢’s are known basis functions. 
For this problem, we will use the functions 


2727 (1-22), O<a<3 
di(z) = 4 (62—-1)(8a-2)?, $< a<3 


0, elsewhere 
32?(3x — 1), O<r< 3 
go(z) = 4 (a@-4)(2-3a)?, FS2<q 
0 elsewhere 
1 
$3(x) = gilx — 3” 
1 


a(t) = bole - 5), 


and 


Q(a—1)(@-2)?, §<a<l 
$s(z) = { 0 elsewhere. 


Graphs of these functions are shown in Figure 4.5. The function pieces that are 
used to construct these basis functions are called Hermite cubics. Hermite cubics 
will be studied in more detail in Section 5.7. 

Next, multiply the differential equation in (4) by (x) (for 7 = 1,2,3,4, 
and 5) and integrate the resulting expression from x = 0 tor = I. This yields 


1 : ; 
O= / (w (2) + dw'"(2)) oj (a) dx =| w(x); (2) ae+af w"(w)b;(@) de. 
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$y (x) 9,00 5%} 


-0.2 
0 01 0.2 0.3 04 0.5 0.6 07 0.8 0.9 1 


Figure 4.5 Graphs of basis functions used in Application Problem 2. 


If we now integrate the integral containing w™ by parts twice and the integral 
containing w” by parts once, we obtain 


0= j (w" (ar) ae) — w! (2) 65 (a) cen (6) 


re [w!" (2)d, (2) + w' (2)b; (x) —wy" (x}45 (2) , 4 


Since $;(0) = $j(1) = $5 (0) = 0 for each j and w’(1) = 0, the second term on the 
right-hand side of equation (6) is identically zero. Consequently, 


[ (w!" (2) (2) — w'(a)64(2)) dx = 0. (7) 


Equation (7) is called the variational, or weak, formulation of (4). 
Substituting (5) into (7) and rearranging terms leads to the system of algebraic 
equations 
Ac = \Be, (8) 


T 
wheree=|c, ¢2 ¢3 Cy es], 


648 0 824 54 0 
1 0 24 -54 6 O 

A= / bf (2) 6; (2) ae =| ~324 -54 648 0 54 
0 54 «G 0 24 6 

0 0 54 6 12 
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and 
36/5 0 —18/5 1/10 0 
1 0 4/45 —-1/10 -1/90 0 
= if $;(2)¢; (x) ae) = | -18/5 -1/10 36/5 0 1/10 
4 1/10 -1/90 0 4/45 -1/90 
0 0 1/10 -~1/90 2/45 


We are interested in nonzero solutions for the vector c, as these correspond to de- 
flected configurations for the rod. In particular, we want to determine the smallest 
value of A (as this corresponds to the smallest axial force) for which (8) has nonzero 
solutions. Thus, we have transformed what was a boundary value problem into an 
eigenvalue problem. Because matrices appear on both sides of (8), this is called a 
generalized eigenvalue problem. 

To proceed with our analysis, note that B is enue positive definite. If we 
let LL? denote the Cholesky factorization of B and then define C = L~1A(Z7)71 
and y = L’c, the generalized eigenvalue problem Ac == ABc becomes the standard 
eigenvalue problem Cy = Ay. In this instance, 


90.0000 0.0000 0.0000 57.2385 11.4146 
0.0000 270.0000  -—39.3835 105.2214 29.4498 
C= | 0.0000 -39.3835 74.6809 40.9283 108.1915 
57.2385 105.2214 40.9283 280.6100 141.9272 
11.4146 29.4498 108.1915 141.9272 292.2044 


Applying the inverse power method with g = 0, a randomly selected initial vector 
and a convergence tolerance of 5 x 107° produces 


EI 
A % 20.314651 > P, % 20.314651— 72 


and 
y = [ 0.101824 0.206840 0.903036 —0.052444 —0.358643 }” 


Solving the equation y = L7c for c and normalizing the result to unit length in 
the f9-norm yields 


m [ 0.117527 0.511028 0.195196 ~0.192401 -0.806175 ]* 


Finally, using the elements of ¢ in (5) generates the deflected configuration shown 
in Figure 4.6. 

The technique we used to convert the boundary value problem in (4) into the 
eigenvalue problem in (8) is known as finite element analysis. The reader who is 
interested in learning more about finite element analysis should consult one of the 
standard texts, such as Johnson [2], Logan [3], or Wait and Mitchell [4]. For a more 
theoretical treatment, see Brenner and Scott [5] or Oden and Reddy [6]. 
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Figure 4.6 Deflection of clamped-hinged rod when loaded with its 
critical buckling load Per. 
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EXERCISES 
In Exercises 1-4, approximate the eigenvalue of the given matrix that is nearest to the 


indicated value, and determine its associated eigenvector. In each case use a convergence 
tolerance of 5 x 1075. 


1 4 $§ 
1 A=/]4 -3 O q=l 


ba Sy “27 
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rol -0.4 ~0.6 
2, A=| -04 1 0.4 q=0.7 
l -0.6 0.4 1 
Pea de 2s <2 
af heel oh, 1 
3. A= 1 0 2 0 q=2 
Ll 1 -3 4 
1667 -7 
4. A=|-1 2 1 q=3 
11 7 -5 


? 


In Exercises 5-8, approximate the smallest eigenvalue, and its associated eigenvector 
for the given matrix. In each case use a convergence tolerance of 5 x 1075, 


42 -2 2 
alicia: dike 
eae ee ee 
a a 
1 2 3 
GES eke A 
3 -1 0 
5 28 
7 AS A Bd 
3-6 7 
Shs ed uO, tell 
en! Gs a 
iad ee ee ae 
a eee 


In Exercises 9-12, sketch the Gerschgorin circles for the given matrix, and obtain an 
approximation for every eigenvalue which is contained in an isolated circle. 


7 2 1 0 1 
1 -8 1 2 —-1 
9 A= 0 -1 0 -1i 0 
2 0 9 9 1 
-l1 0 -1 0 2 
14-8 2 1 
2 -12 1 =O 
10. A= ce eee Oe 
0 --1 2 3 
820 1 
202 0 
Wl. A= 021] 
i041 -8 


13. 


14. 


15. 
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20 3 -1 #21 
3. T -2 2 
om -l -2 -5 1 


Each of the following matrices has two distinct eigenvalues that are equidistant 
from the indicated value of g. Apply the inverse power method with the indicated 
value of g and several randomly selected initial vectors. Limit calculations to at 
most 20 iterations and comment on the behavior of the eigenvalue and eigenvector 
sequences. 

5.5 -2.5 -2.5 —-1.5 

—2.5 5.5 1.5 2.5 


(a) A=| os 15 55 28 G= 6 
hr, Op i 55 
eo 20 

(BAS | Or. 0) 4 | a5 
4-17 8 


Each of the following matrices has a unique eigenvalue of multiplicity greater 
than one nearest to the indicated value of g, but that eigenvalue does not possess 
a complete set of linearly independent eigenvectors. Apply the inverse power 
method with the indicated value of g and several randomly selected initial vectors. 
Limit calculations to at most 20 iterations and comment on the behavior of the 
eigenvalue and eigenvector sequences. 


00 0 0 -189 


10 0 0 27 
(a) A=| 0 1 0 0 126 q=4 
001 0 —74 
000 1 = #15 
2.75 0.25 —8.75 3,25 
(b) A= 0.25 2.75 -3.25 —3.75 et 


—3.25 -3.75 2.25 0.75 
-3.75 -3.25 0.75 2.25 


A common problem which arises in computer graphics is the determination of the 
axis and angle of rotation for a composite rotation matrix, R (see S. Alessandrini, 
“A Motivational Example for the Numerical Solution of the Algebraic Eigenvalue 
Problem,” SIAM Review, 40(4), 935-940, 1998). Here, we will focus on deter- 
mining the axis of rotation. By definition, the axis of rotation, a, is unchanged 
by R; that is, Ra = a. Hence, a is an eigenvector of the rotation matrix associ- 
ated with the eigenvalue 1. It is conventional to report the axis of rotation with 
unit length in the Euclidean norm. Determine the axis of rotation for each of 
the following rotation matrices. 
0.686237 —0.714977 —0.133738 

(a) R=} 0.652225 0.686237 —0.321996 | 
0.321996 0.133738 0.937247 


0.230560 —0.494104 0.838273 
(b) R= | 0.041493 -0.855708 —0.515793 
0.972173 0.153704 -—0.176790 
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16. 


17. 


18. 


19. 
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0.753220 0.041955 0.656429 
(c) R=] 0.550316 0.506441 0.663830 
—0.360293 0.861253 0.358373 


A sporting goods store sells 20-gallon aquariums. The store starts each week 
with at most four aquariums in stock. At the end of each week, if there are no 
aquariums left in stock, an order for four aquariums is placed; otherwise no order 
is placed. 

The steady-state probability distribution for the number of aquariums that 
are in stock at the beginning of any given week (1, 2, 3, or 4) is given by an 
eigenvector of the state transition matrix 


e 4 0 0 l1—e# 

pe # e# 0 1-(1+pje™ 
wen /2 we e? 1—(l+p+ p2/2)e7# 
we#/6 pre#/2 pe 1— (uty? /2+ u3/6)e7# 


P= 


associated with the eigenvalue \ = 1. To be a valid probability distribution, the 

eigenvector must be normalized sc that the sum of its components is equal to 

one. Here, yt represents the average number of aquarium sales per week. 
Determine the steady-state probability distribution for the number of aquari- 


ums in stock at the start of a week for w = 1.0 through uw = 2.0 in increments of 
0.25. 


In astudy of the distribution of red and gray squirrels among habitats in Scotland 
(Usher, Crawford, and Banwell, “An American Invasion of Great Britain: The 
Case of the Native and Alien Squirrel Species,” Conservation Biology, 6, 108- 
115, 1992), the following transition matrix was reported: 


0.874 0.095 0.077 0.059 
0.015 0.722 0.007 0.193 
0.109 0.095 0.906 0.143 
| 0.002 0.087 0,009 0.605 


In constructing this matrix, each habitat was classified into one of four categories: 
only red squirrels present, only gray squirrels present, neither species present and 
both species present. Determine the steady-state distribution associated with the 
given transition matrix. 


The paper cited in Exercise 17 also contains the transition matrix 


0.8797 0.0212 0.0981 0.0010 

0.0382 0.8002 0.0273 0.1343 
0.0525 0.0041 0.8802 0.0633 | ’ 

0.0008 0.0143 0.0527 0.9322 


which considers habitats in all of Great Britain, not just Scotland. Determine 
the steady-state distribution associated with the given transition matrix. 


Usher (“Studies on a, Wood-Feeding Termite Community in Ghana, West Africa,” 


‘Biotropica, 7, 217-233, 1975) investigated changes in the species of termite in- 


festing baitwood blocks over a 48-week period. Every four weeks, the blocks were 
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inspected and classified according to the following schema: species Ancistroter- 
mes present; species Macrotermes present, species Pseudacanthotermes present; 
species Microtermes present; known species other than previous four present; 
unknown species present; and no termite activity. The transition matrix among 
these classifications was found to be 


0.471 0.336 0.326 0.287 0.378 0.400 0.375 
0.029 0.187 0.054 0.045 0.010 0.041 0.018 
0.057 0.075 0.203 0.071 0.057 0.021 0.064 
0.068 0.075 0.084 0.310 0.057 0.041 0.048 
0.031 0.030 0.015 0.047 0.130 0.028 0.030 
0.018 0.007 0.007 0.017 0.093 0,055 0.035 
0.326 0.291 0.310 0.223 0.275 0.414 0.430 


Determine the steady-state distribution associated with the given transition ma- 
trix. 


20. Let A be an n x n matrix. Suppose the eigenvalues of A satisfy the relations 
Ay > Ag 2 Ag B+ 2B An-1 > An. 


Note the absence of absolute values. The eigenvalue A, is the largest eigenvalue 
of A, but not necessarily the dominant eigenvalue. Similarly, \,, is the smallest 
eigenvalue of A, but not necessarily the eigenvalue of smallest magnitude. Devise 
an algorithm to determine A; and An (and corresponding eigenvectors) assuming 
|Aa| # |An|- 

21. Using the algorithm from Exercise 20, determine the largest (1) and the smallest 
(An) eigenvalues of the following matrices: 
(a) the matrix from Exercise 9.  (b) the matrix from Exercise 10. 
(c) the matrix from Exercise 11. (d) the matrix from Exercise 12. 


22. Hoffman [7] established the following lower bound for the chromatic number of 
@ graph: 
Al 


Sige 
a |An| 


where 1 is the largest and An the smallest eigenvalue of the adjacency matrix 
associated with the graph. Use this formula to obtain a lower bound for the 
chromatic number of the graphs in Figures 4.2 and 4.3. 


Exercises 23-28 deal with the generalized eigenvalue problem 
Ax = Bx, 


where A and B aren x nm matrices. 


23. Suppose that B is symmetric positive definite and has the Cholesky factorization 
LL’. Show that the generalized eigenvalue problem Ax = Bx is equivalent 
to the standard eigenvalue problem Cy = Ay, where C = de A and 

T. 
y=L'x. 


296 Chapter 4 Eigenvalues and Eigenvectors 


24. In the “Critical Buckling Load” application problem, the matrix A was also 
symmetric positive definite. Let LL* denote the Cholesky factorization of A 
and then define C = L~1B(L?)~+, y = Lc, andw = 1/A. 

(a) Show that the generalized eigenvalue problem Ac = \Bc is equivalent to 
the standard eigenvalue problem Cy = wy. 

(b) Determine the dominant cigenvalue and associated eigenvector for the prob- 
lem Cy =wy. What are the corresponding values of \ and ¢? 

(c) How do the results of part (b) compare to those presented in the text? 

25. Suppose that B is nonsingular. Construct an algorithm to approximate the 
dominant eigenvalue of Ax = ABx and its associated eigenvector. 

26. Suppose that A is nonsingular. Construct an algorithm to approximate the 
eigenvalue of smallest magnitude of Ax = Bx and its associated eigenvector. 


27. For the given matrices A and B, use the algorithm of Exercise 25 to determine 
the dominant eigenvalue of Ax = \.Bx and its associated eigenvector. 


16 -8 0 0 16 2 0 0 
roll eR AR EB I ee ee 
(aye), See RS a Be a 72! ie. 2 
0 1 -8 7 a a ae 

401 0 1 0 

(6) AE li ee cg a-|c 0 1 

25 a 4 -17 8 


28. For the matrices A and B given in Exercise 27, use the algorithm of Exercise 
26 to determine the eigenvalue of smallest magnitude of Ax = ABx and its 
associated eigenvector. 


4.3. DEFLATION 


In the previous two sections, we have developed procedures for approximating the 
dominant eigenvalue of a matrix, the eigenvalue of a matrix that is smallest in 
magnitude and the eigenvalue of a matrix closest to some specified value. What if 
we need to approximate several of the largest eigenvalues or several of the smallest 
eigenvalues? For example, in a principal component analysis, we may want to 
determine more than the first principal component. 

One approach to handling problems of this type would be to compute the 
entire spectrum of the matrix—approximate every eigenvalue. For a small matrix, 
this might not be a bad approach. For a large matrix, however, computing the entire 
spectrum when only a small portion of the spectrum is needed would be extremely 
wasteful of computational effort. When only a smal] portion of the spectrum is 
desired, it is much more efficient to employ a deflation strategy. 

We first encountered the technique of deflation in Section 2-8 while investi- 
gating the polynomial rootfinding problem. Recall that the objective of deflation is 
to “remove” an already determined solution from the problem. Within the context 
of polynomial rootfinding, we removed each root as it was computed by dividing 
out the corresponding monomial. For the matrix eigenvalue problem, “removal” 
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is conventionally taken to mean modifying the matrix so as to shift the previously 
_ determined eigenvalue to zero, while leaving the remainder of the spectrum un- 
changed. 

In this section we will consider two deflation techniques: Wielandt deflation, 
which can be applied to any matrix, and Hotelling deflation, which is specifically 
designed for symmetric matrices. An additional technique will be considered in the 
exercises. 


Eigenvalues and Eigenvectors of a Matrix and Its Transpose 


To establish the key theorem behind Wielandt defiation, we must first establish the 
relationship among the eigenvalues and eigenvectors of a matrix A and its transpose. 
Recall that the eigenvalues of AT satisfy the equation det(A? — AZ) = 0. Since 


det(A” — AI) = det [(A — AI)" = det(A — XJ), 


it follows that the eigenvalues of A? are the same as those of A. The eigenvec- 
tors are generally not the same, but there is an important relationship among the 
eigenvectors associated with different eigenvalues. 

Let v; be an eigenvector for the matrix A associated with the eigenvalue A,, 
and let w; be an eigenvector for the matrix A’ associated with the eigenvalue Xj; 
where \; # A;. Taking the transpose of the eigenvalue equation Av; = Aiv; and 
postmultiplying the result by w, yields vi ATw; = Ave wy. Using the eigenvalue 
equation A? w; = A;w; then leads to Ayv/ w; = Avvfw; or (Ay — Ai)v/ w; = 0. 
Since we have assumed \; # Aj, it follows that v7 w; = 0. In other words, eigenvec- 
tors from A and A” that are associated with different eigenvalues are orthogonal 
with respect to the standard inner product on R”. 


An Important Matrix Transformation 


Let A be ann xn matrix with eigenvalues A,, Az, A3,-.-, 4n and associated eigen- 
vectors Vj, V2, V3,.-.,V¥n. Suppose that by some means (e.9., the power method 
or the inverse power method) we have obtained approximations for \, and v,, and 
we now wish to deflate the spectrum of the matrix A. 
Consider the matrix 
B=A-Ayvix!, (1) 


where x is an arbitrary n-vector. If we postmultiply equation (1) by the vector vi, 
we find 
By, =a Avi = Aywix? vy 
— V1 = Aix? vy 
= MV (1 = x" v;) . 


Hence, provided x is chosen so that x? v; = 1, zero is an eigenvalue of the matrix B 
with associated eigenvector vj. 
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What about the other eigenvalues of A? Have they been changed by the 
transformation performed in equation (1)? Taking the transpose of equation (1) and 
postmultiplying the result by w;, where i = 2,3,4,...,7 and w;, is an eigenvector 
of A? associated with the eigenvalue 4;, gives 


Bw; = Al w; — dyxvw; 
=> Wi = 0 


= AgWy- 


Note that in going from the first line to the second, we have used the orthogonality 
of the eigenvectors of A and A” that are associated with different eigenvalues. This 
establishes that Az, 3, A4,.--,An are eigenvalues of B?, which implies that they 
are also eigenvalues of B. 

The last issue to address is the eigenvectors of the matrix B. Let u; denote 
an eigenvector of B associated with the eigenvalue \;. We have already established 
that A; = 0 and u; = v;. Given the construction of the matrix B, we can assume 
that for 4 = 2,3,4,...,n, 

vi = au; + Avi, (2) 
where @ and 3 are constants whose value is to be determined. If we postmultiply 
equation (1) by v3, 


By; = Avi - Mvix' v; 

= AG: = Ax? Vv; 

and then substitute for v, from equation (2), we find 
B(au; + Bvi) = A;(au; + Bvi) — drvix? (au; + Bv1). 


Clearing parentheses and using the relations x?v, =1, Bv; =0, and Bu; = A;u; 
leads to 
au; = how; + 4,8v;) — oA (x? )v1 - Avi, 


or 
0 = [A0i — dx) — r(x? w)] va. (3) 


One solution of this equation is 
a= (Ai sas A1) and f= Ar(x? uy), 


which yields 
vi = (a — Arjur + Av? wa) v1. (4) 


Any other solution to equation (3) will produce a multiple of the eigenvector given 
by this last equation. 
We can summarize these results into the following theorem: 


Section 4.3 Deflation 299 


Theorem. Let A be ann x n matrix with eigenvalues 1, Ae, Ag,-.., An and 


associated eigenvectors v1, V2, V3,---,Vn, and let x be any n-vector for which 


x? vy, =1. Then the matrix 


B=A- Arvix? 


has eigenvalues 0, Az, A3,---, An with associated eigenvectors v1, U2, U3,-.-., 
uy, where for i = 2,3,4,...,7 


Vi= 3 _ Ay) uy +1 (x7? uj)v4. 


Wielandt Deflation 


In Wielandt Deflation, the deflation vector x is chosen to be 


Qk1 
an2 
a 
1 3 
= , ; 
AqU1,k 
akn 


where v;,, denotes the kth element of the vector v1. The values ax1, G42, Gh3,+-+, 
Qkm correspond to the kth row of the matrix A written as a column vector. The 
value of & can be any index for which v, is nonzero, but we will consistently choose 
the smallest index for which |v1,,| is equal to the infinity norm of the vector vj. 
With this choice for x, 


[kth row of A] vy 


= [kth element of the product Avy] 
Aik 


z [kth element of A1vi] 
AiU1,k 


= Mave AU = 1; 
Therefore, the hypothesis of the deflation theorem is satisfied. 

We get an extra bonus with the Wielandt deflation vector. Each row of the 
matrix A,v x? is a multiple of the kth row of A. In particular, the ith row of 
Avix! is vii/vise times the kth row of A. This implies that the kth row of 
B= A~— i v.x" consists entirely of zeros. Suppose that u is an eigenvector of B 
associated with the eigenvalue A # 0. Given that B has all zeros along the kth row, 
the kth element of the product Bu, which is just Xu, must be zero, and therefore 
ux = 0. This, in turn, implies that the Ath column of B has no influence on the 
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product Bu. Thus, before searching for the next eigenpair, we can reduce the size 
of B by deleting the kth row and the kth column. 

When we take advantage of this reduction in size, an extra detail must be 
accounted for when equation (4) is applied to convert the eigenvectors of B into 
the eigenvectors of A. The vector u that appears on the right-hand side of the 
equation will be one element smaller than the other vectors. To compensate for the 
size difference, a zero must be placed between the (k — 1)st and kth elements of u 
before equation (4) is used. 


EXAMPLE 4.6 Wielandt Deflation in Action 
Consider the 4 x 4 matrix 


11 -6 4 ~-2 


_| 4 1 0 0 
Hire gc 19 Bs 2 
~6 6 -6 7 


Let’s determine the two largest eigenvalues and associated eigenvectors of A. 
Applying the power method, we find the dominant eigenvalue of A to be 


A, = 5, with corresponding eigenvector v; = [ 1 10 0 le In order to focus 
on the deflation process, we will ignore the effects of roundoff error in this example. 
With k = 1, the deflation vector is 


x [first row of Al” = [11 -6 4 -2 ‘ae 


oup Re 


ALU1 


Forming the matrix \;vix7? and subtracting the result from A gives the matrix B: 


14 26a 0 0 0 0 

Pa eee or ee Nei ae 8 
RIE | ie Lies AOS 3G re SN (aa hea Bag 
6 0 (6 7 


0 ~6 6 —6 


Deleting the first row and first column from this matrix produces the matrix 


7 -4 2 
Bl=/]9 -6 5 
6 -6 7 


Applying the power method to B’ generates Az = 4 and uy = [0 1/2 1 jes 

To complete the deflation process, we must convert the eigenvector uy to 
correspond to the original matrix. Since k = 1, we first prepend a zero to u, to 
create the 4-vector ug which appears on the right-hand side of equation (4). Using 
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equation (4), we then obtain 


va (ra = Ai )ue “- Ay (x? u) V1 


0 0 1 
7 0 1 0 1 
S| a es [11 -6 4 -2] 1/2 ; 
ib 1 0 
0 
2 0 
~ | -1/2 
=] 
We therefore have the eigenpairs 
(5,{1 10 0]") and (4,[0 0 -1/2 -1 J’). 


Variation for Symmetric Matrices 


Hotelling Deflation is specifically designed for symmetric matrices. Therefore, let 
A be a symmetric n x n matrix with eigenvalues 4;, A2, A3,..., An and associated 
orthogonal eigenvectors v1, V2, V3,.-.,Vn- Suppose that by some means (e.g., 
the power method or the inverse power method) we have obtained approximations 
for 4; and v,, and consider the matrix 


»y T 
B=A-— ae Vivy. (5) 
To begin, note that 
BT = AT ~ (vt )* 
Vv} 
At T 
= A = = B, 
viv ViVi 


so B is asymmetric matrix. Next, by direct calculation, we find 


At 
By, = Avy FR vivivy 
Vivi 
= MV1 = AiV} 


= 0, 


so that 0 is an eigenvalue of B with associated eigenvector v,. Furthermore, for 
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i= 2,3,4,...,n 


Mi 
By; = Avi - => vyyvi vi 
viv 
1 ‘1 
= GV; —0 


= AV, 


where, in going from the first line to the second, we have used the orthogonality of 
the eigenvectors of a symmetric matrix. Thus the transformation given by equa- 
tion (5) shifts the eigenvalue A; to zero, but preserves every other eigenvalue and 
every eigenvector of the matrix A. 


EXAMPLE 4.7 Hotelling Deflation in Action 


Consider the symmetric 4 x 4 matrix 


4 2/3 -4/3 4/3 
jes | Bie HS OO 0 
-4/3 0 6 2 
Ags Be 1G 


We will now determine the eigenpairs with the two largest eigenvalues using Hotel- 
ling deflation. 

The power method for symmetric matrices applied to A finds the dominant 
eigenvalue to be A; = 8 with associated eigenvector, normalized in the Euclidean 


norm, vi=[0 0 V2/2 V2/2 le Using this eigenpair, we compute 


0 
Ree age 284). 50 
=- 2/2 2/2 
viv I 2/2 [ 0 0 V2/ V2/ | 
v2/2 
000 0 
_|0 00 0 
~|0 0 4 4 
00 4 4 
and then 
‘ 4 2/3 4/3 4/3 
p-| 2732 4 9 o 


-4/3 0 2 -2 
4/3 0 -2 2 


The power method for symmetric matrices applied to B then gives A2 = 6 and 


vo =[ v3/2 v3/6 -v3/3 V3/3 
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Application Problem: Measuring the Student Experience 


In the Chapter 4 Overview (see page 261), the following matrix of correlations 
among seven measures of the “student experience” for the four-year colleges and 
universities in the Commonwealth of Virginia was presented: 


1.0000 -0.2411 0.4931 0.3009 -0.6865 0.9493 0.7538 
—0.2411 1.0000 —0.5535 -0.0387 0.1256 —0.1698 0.0684 
0.4931 —0.5535 1.0000 -0.2095 —0.1546 0.3972 -0.0643 
R= 0.3009 —0.03887 -—0.2095 1.0000 —0.2357 0.3994 0.4033 
—0.6865 0.1256 -0.1546 -0.2357 1.0000 —0.7761 —0.7330 
0.9493 —0.1698 0.3972 0.3994  -0.7761 1.0000 0.7601 
0.7538 0.0684 —0.0643 0.4033 —0.7330 0.7601 1.0000 


The measures are, in order, the percentage of first year students who return for their 
second year, the percentage of classes with fewer than 20 students, the percentage 
of classes with more than 50 students, the percentage of classes taught by full-time 
faculty, the average number of years needed to graduate in the current graduating 
class, the percentage of first-time full-time students who graduate within six years 
and the donation rate for alumni. 

Recall that the eigenvectors of R represent uncorrelated linear combinations 
of the original variables known as principal components. These principal compo- 
nents are ranked according to the eigenvalues of R. In particular, the eigenvector 
associated with the largest eigenvalue of R is called the first principal component, 
the eigenvector associated with the next largest eigenvalue is called the second prin- 
cipal component, and so on. Further, the percentage of variation accounted for by 
each principal component is given by the ratio of the associated eigenvalue to the 
number of variables. 

Table 4.3 displays the first two principal components for the matrix R. These 
were obtained by using the power method for symmetric matrices and Hotelling 
deflation. Observe that these two principal components account for more than 75% 
of the variation in the original data. The largest entries in the first principal com- 
ponent correspond to students returning for the second year, years to graduation, 
graduation rate and alumni donation rate. This component may therefore be in- 
terpreted as a measure of the overall college experience. On the other hand, the 
largest entries in the second principal component primarily correspond to classroom 
data. This component may therefore be interpreted as a measure of the classroom 
experiences of students. 


Repeated Deflation 


The process of spectrum deflation, be it Wielandt deflation or Hotelling deflation, 
can be repeated as each new eigenvalue-eigenvector pair is determined. Combining 
repeated deflation with the power method can be used to determine the first few 
largest eigenvalues of a matrix; alternatively, repeated deflation combined with the 
inverse power method can be used to determine the first few smallest eigenvalues. 
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First 
Component 
Students returning for second year 0.4971 
Classes with fewer than 20 students —0.1318 
Classes with more than 50 students 0.1942 


Classes taught by full-time faculty 0.2265 

Time to graduation —0,4444 
Graduation within six years 0.5073 
Alumni donation rate 0.4378 

Eigenvalue 3.6403 

Cumulative 

% Variation 52.00 


Second 
Component 
—0.0994 
0.5731 
—0.6610 
0.3422 
—0.1027 
—0.0072 
0.3115 


1.6759 


75.95 


TABLE 4.3: Principal Components for Student Experience 


In principle, deflation could be used to compute all of the eigenvalues of a matrix, 
impractical even for 


but the accumulation of roundoff error makes such a scheme 
matrices of only moderate size. 
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EXERCISES 


1. For each of the following matrices, an eigenvalue-eigenvector pair is given. Deter- 
mine the deflation vector x and the deflated matrix B corresponding to Wielandt 


deflation. 

73 0 0 0 
(a) A = 02 2 | > : AY = 4, v= J 

[0 1 3 1 

0 1 0 1/16 

(b) Alc 0 1]; M1 = 4, wef 

| 4 -17 8 1 

ig 4 

4 


HOw, 
a 
t 
ex 
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2. For each of the following matrices, an eigenvalue-eigenvector pair is given. De- 
termine the deflated matrix B corresponding to Hotelling deflation. 


r3 0 0 0 
Gy A=) 02 2, Arab. weer |i 
LO 2 5 2 
f 4 -1 1] 1 
(by AS Weal BAO, Ae eee eee |e 
L 1 -2 -8 1 
fF -2 0 36 3/5 
(c) A= 0 -3 0 » AL=-50, wi=] 0 
l -36 0 ~23 4/5 
fF 6 -1 -1 -l 0 
col eat» SOS eam “sot S Pipe 
(d) A= ee ee ee ee Ay = 11, v= 
-1 -1 -1 10 1 
3. Given the following information, use equation (4) to construct the eigenvector 
v2. 
l 6 0 
(a) ub = | : i k=3, x=3| -6|, m=| 1/2], =4, 2=3 
7 1 
7 1 H 
F «| 6 i 7 
(b) uy = -1 ghd x= 4 » VW= 0 ,AL=5, 2=3 
: 2 0 
1 4 1 
@ w=| 4], e=1 Tae v= 4 , 1 =6, A =3 


For Exercises 4-9, determine the first two dominant eigenvalues and associated eigen- 
vectors for the specified matrix. 


—101 ol -12 0 0 
—174 88 —20 0 0 
4, 136 —68 19 0 0 
840 -420 105 -32 1 
2106 -1008 252 —Sd 4 


7 2 10 21 
a ore ae 

Beall ie ei Or, Sa <0 
2 0 0 9 1 
Se oa 
Ath eae ee 
2a) Be i 4 

Be laa GF 3; 10 
ok a Oe 
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02 
6 
15 


1.00 
0.01 
9. | 0.97 
0.44 
0.02 


In Exercises 10-15, a correlation matrix for a set of measured variables is given. Deter- 
mine the first two principal components and the percentage of variation accounted for 


Eigenvalues and Eigenvectors 


6 15 1 5 3 5 
16 0 -13 4 -9 ~2 
0 58 ~3 8 1 6 
-13 -3 30 1 -—7 ~3 
4 8 1 42 3 5 
~9 1 -7 3 28 -1 
—-2 6 -3 5 -1 44 
1 1 0 3 

3.0 1 =O 

0 -6 2 1 

2 1 9 O 

0 1 0 ~2 

0.01 0.97 0.44 0.02 
1.00 0.15 0.69 0.86 
0.15 1.00 0.51 0.12 
0.69 0.51 1.00 0.78 


0.86 0.12 0.78 1.00 


by those components. Provide an interpretation for each principal component. 


10. Dunteman (Principal Components Analysis, Sage University Press Series on 
Quantitative Applications in the Social Sciences, 07-069, Sage Publications, Bev- 
erly Hills, 1989) presents the following matrix related to the satisfaction of mar- 
ried army enlisted personnel. The variables are satisfaction with job, job training, 


working conditions, medical care, and dental care, 


11. Harman (Modern Factor Analysis, The University of Chicago Press, Chicago, 
1960) summarizes the correlation among eight measured variables for 305 female 
subjects. The measured variables are, in order, height, arm span, forearm length, 


1.000 
0.451 
0.511 
0.197 
0.162 


0.451 
1.000 
0.445 
0.252 
0.238 


Q.511 0.197 0,162 
0.445 0.252 0.238 
1.000 0.301 0.227 
0.301 1.000 0.620 
0.227 0.620 1.000 


lower leg length, weight, pelvic breadth, chest girth and chest width. 


12. Using data from a study of Olympic track records (D. N. Naik and R. Khattree, 
“Revisiting Olympic Track Records: Some Practical Considerations in the Prin- 
cipal Components Analysis,” The American Statistician, 50, 140-144, 1996), 


1.000 
0.846 
0.805 
0.859 
0.473 
0.398 
0.301 
0.382 


0.846 
1.000 
0.881 
0.826 
0.376 
0.326 
0.277 
0.415 


0.805 
0.881 
1,000 
0.801 
0.380 
0.319 
0.237 
0.345 


0.859 
0.826 
0.801 
1.000 
0.436 
0.329 
0.327 
0.365 


0.473 0.398 
0.376 0.326 
0.380 0.319 
0.436 0.329 
1.000 0.762 
0.762 1.000 
0.730 0.583 
0.629 0.577 


0.301 
0.277 
0.237 
0.327 
0.730 
0.583 
1.000 
0.539 


0.382 
0.415 
0.345 
0.365 
0.629 
0.577 
0.539 
1.000 
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the following matrix summarizes the correlations among speeds for the 100 m, 
200 m, 400 m, 800 m, 1500 m, 5000 m, 10000 m, and marathon events. 


1.000 
0.910 
0.829 
0.751 
0.692 
0.600 
0.610 
0.499 


0.910 
1.000 
0.848 
0.803 
0.771 
0.686 
0.688 
0.588 


0.829 
0.848 
1.600 
0.872 
0.831 
0.767 
0.780 
0.703 


0.751 
0.803 
0.872 
1.000 
0.907 
0.851 
0.856 
0.796 


0.692 
0.771 
0.831 
0.907 
1.000 
0.924 
0.931 
0.860 


0.600 
0.686 
0.767 
0.851 
0.924 
1.000 
0.970 
0.927 


0.610 
0.688 
0.780 
0.856 
0.931 
0.970 
1.000 
0.942 


13. In a study of vertical jump ability (I. Kollias, V. Hatzitaki 


0.499 
0.588 
0.703 
0.796 
0.860 
0.927 
0.942 
1.000 


, G. Papalakovou, 


and G. Giatsis, “Using Principal Components Analysis to Identify Individual 
Differences in Vertical Jump Performance,” Research Quarterly for Exercise 


and Sport, 72, 63-66, 2001), the following correlation matrix 


is presented. The 


measured variables are peak force relative to body mass, peak power relative to 
body mass, maximum rate of force development, time to peak force, push-off 


duration and vertical displacement of center of mass. 


1.000 0.897 0.578 —0.381 -0.483 —0.226 
0.897 1.000 0.494  -—0.237 -0.391 —0.054 
0.578 0,494 1.000 —0.650 —0.589 —0.088 
—0.381 -—0.237 —0.605 1.000 0.915 0.087 
—0.483 -0.391 -0.589 0.915 1.000 = 0.141 

—0.226 —0.054 —-0.088 0.087 0.141 1.000 


14. 


In a study of the effect of pigment levels on photosynthesis, Cassie (“Relationship 


Between Plant Pigments and Gross Primary Production in Skeletonema costa- 
tum,” Limnology and Oceanography, 8, 433-439, 1963) reports the following 


correlation matrix for chlorophyll ea, chlorophyll c, carotenoi 
gross production. 


0.788 
0.694 
0.808 
0.770 
1.000 


1.000 
0.989 
0.892 
0.972 
0.788 


0.989 
1.000 
0.855 
0.958 
0.694 


0.892 
0.855 
1.000 
0.874 
0.808 


0.972 
0.958 
0.874 
1.000 
0.770 


15. 


ds, cell count and 


Johnson and Wichern (Applied Multivariate Statistical Analysis, Prentice Hall, 


Englewood Cliffs, 1982) report the following correlations among the weekly rates 
of return for three chemical stocks (Allied Chemical, DuPont, Union Carbide) 


and two oil stocks (Exxon and Texaco). 


1,000 
0.577 
0.509 
0.387 
0.462 


0.577 
1,000 
0.599 
0.389 
0.322 


0.509 
0.599 
1.000 
0.436 
0.426 


0.387 
0.389 
0.436 
1.000 
0.523 


0.462 
0.322 
0.426 
0.523 
1.000 
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16. 


17. 


18. 


19. 


20. 


21. 
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Let A be ann x 7 matrix. Suppose the eigenvalues of A satisfy the relations 
dz > Ag > Ag >! > An-1 > An. 


Note the absence of absolute values. Devise an algorithm to determine A». On 
what assumptions is your algorithm based? 


Using the algorithm from Exercise 16, determine A for the following matrices: 
7 2 1 0 1 
1 -8 #1 2 —1 
(a) A=] 0 -1 0 -1 9 
2 0 0 9 1 
-1 0 -1 0 
(b) 
16-8 2 1 
5 |) Be eT ay 8 
(b) A= —] 1 -4 1] 
0 -1 2 8 
rF&8 20 1 
iw a oe 
te: PST ogra 
[okt the Shy! 228 
20 3 -1 #1 
3 7 -2 2 
CE AS atte ip, eee 
1 2 1 -8 


The second largest eigenvalue, A2, of the adjacency matrix associated with a 
graph can be related to what are known as the diameter and mean distance of 
the graph. See Mohar and Poljak {1] for details. Determine the second largest 
eigenvalue of the adjacency matrices associated with the graphs in Figures 4.2 
and 4.3. 

Let A be an n x » matrix with eigenvalues 1, Az, A3,.--,4n and associated 
eigenvectors v1, Ve, V3,-.., Vn. Further, let w) be the eigenvector of A? asso- 
ciated with A; for which wry = 1. Consider the matrix B = A—A,viw}. 
(a) Show that B has the same eigenvectors as A. 

(b) Show that B has eigenvalues 0, A2, Ag,..-,An- 


Construct an algorithm to determine the first two dominant eigenvalues of a 
matrix A using the deflation strategy of Exercise 19. 


Repeat Exercises 4-9 using the algorithm developed in Exercise 20. 


4.4 REDUCTION TO SYMMETRIC TRIDIAGONAL FORM 


In the final two sections of this chapter we will address the issue of computing all 
of the eigenvalues of a matrix. Because the eigenvalues of symmetric matrices are 
well-conditioned whereas the eigenvalues of nonsymmetric matrices can be poorly 
conditioned and because an n x n symmetric matrix always possesses n linearly 
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independent eigenvectors whereas a nonsymmetric matrix may not, we will restrict 
our attention to symmetric matrices only. For discussions and detailed algorithms 
related to the nonsymmetric eigenvalue problem, we refer the interested reader to 
the classic works of Wilkinson [1] and Wilkinson and Reinsch [2], as well as the 
more recent texts by Golub and van Loan [3] and Saad [4]. 

To compute all of the eigenvalues of a symmetric matrix, we will proceed in 
two stages. First, the matrix will be transformed to symmetric tridiagonal form. 
This stage requires a fixed, finite number of operations {i.e., the procedure is not 
iterative). In the second stage, an iterative procedure is applied to the symmetric 
tridiagonal matrix produced by the first stage. The iteration generates a sequence of 
matrices which will converge to a diagonal matrix. The eigenvalues of this diagonal 
matrix are, of course, just the elements along the main diagonal (recall Exercise 11 
of Section 3.3). Taking into account the cumulative effect upon the eigenvalues of 
all of the transformations which have been applied to the original matrix yields the 
eigenvalues of the original matrix. 

So why do we proceed in two stages? Why don’t we just perform the iterative 
technique on the original matrix? Simply put, the answer is efficiency. Transforming 
al m X n symunetric matrix to symmetric tridiagonal form requires on the order of 
Ans arithmetic operations for large n. The iterative reduction of the symmetric 
triadiagonal matrix to diagonal form then requires O(n”) arithmetic operations. 
On the other hand, applying the iterative technique directly to the original matrix 
requires on the order of an arithmetic operations per iteration. Thus, by first 
transforming the matrix to a simpler form we significantly reduce the computational 
cost. 

We will focus on the reduction to symmetric tridiagonal form in this section. 
The second stage will be considered in the next section. To develop the reduction 
procedure, two special tools are needed: similarity transformations and orthogonal 
matrices. We will therefore start with a discussion of these items. Next an algorithm 
for reducing a symmetric matrix to symmetric tridiagonal form will be developed. 
Finally, we will discuss the steps needed to obtain the eigenvectors of a symmetric 
matrix. 


Similarity Transformations and Orthogonal Matrices 


Transforming a symmetric matrix to symmetric tridiagonal form is meaningless un- 
less we have precise knowledge of how the eigenvalues have been affected by each 
transformation that has been performed. Fortunately, there is a class of transfor- 
mations which does not change the spectrum of a matrix. These are known as 
similarity transformations. 


Definition. Let A be an n x n matrix and let M be any nonsingular n x n 
matrix. The matrix B = M-1AM is said to be SIMILAR to A. The process 
of converting A to B is called a SIMILARITY TRANSFORMATION. 


To establish that a similarity transformation does not affect any of the eigen- 
values of A, we proceed as follows. The eigenvalues of B are solutions of the 
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equation det(B — AI) = 0; but 
det(B— AT) = det(M7*AM — XJ) 
= det [MA = ADM] 
= det(M~*) det(A ~ AF} det(M) 
= (det(M))~* det(A — AI) det(M) 
= det(A — XJ). 
Thus, det(B — AF) = 0 if and only if det(A — \J) = 0, which implies that A and B 
have exactly the same eigenvalues. 
Although any nonsingular matrix can be used to generate a similarity trans- 


formation, we would like to use matrices whose inverses are easy to compute. The 
class of orthogonal matrices will suit our needs nicely. 


Definition. The nxn matrix Q is called an ORTHOGONAL MATRIX if Q-) = 
QT. 


As an example, consider the matrix 


1 0 0 
Q=/|0 V2/2 —V2/2 
0 V2/2 2/2 


Direct multiplication shows that 


1 0 0 1 0 0 10 0 
QQ? =| 0 2/2 -V2/2 0 2/2 V2/2}=|0 1 0]=r 
0 2/2 2/2 0 —Y2/2 2/2 001 


and that Q@7Q = I. Hence, Q-1 = QT and Q is an orthogonal matrix. 

Aside from having an inverse matrix that is easy to compute, an orthogonal 
matrix has several other important properties, many of which will be treated in the 
exercises. The most important of these properties is related to the conditioning of 
the eigenvalue problem. As noted on several occasions, the eigenvalues of symmetric 
matrices are well conditioned. Suppose then that A is a symmetric matrix and 
B = Q71AQ = Q7 AQ for some orthogonal matrix Q. It follows that 


BT = (QTAQ)” = Q7AQ = B. 


Hence, a similarity transformation with an orthogonal matrix maintains symmetry 
and therefore preserves the conditioning of the original eigenvalue problem. 

Another property of orthogonal matrices that we will find particularly useful 
is that multiplication by an orthogonal matrix does not change the Euclidean norm 
of a vector. For, if @ is any orthogonal matrix and x is any vector, then 


(Qx)?Qx = Vx? Q7Qx = Vx" x, 
since Q7Q = I. 
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xX 


Ax 


Figure 4.7 Reflection of a vector x across the hyperplane whose nor- 
mal vector is w. 


Reducing a Symmetric Matrix to Tridiagonal Form 


There are several different algorithms available for reducing a symmetric matrix 
to tridiagonal form. Most work in a sequential manner, applying a succession 
of similarity transformations which gradually produce the desired form. These 
techniques differ only in the family of orthogonal matrices used to generate the 
similarity transformations. There are also direct “tridiagonalization” techniques 
that determine the single orthogonal matrix needed to reduce the original matrix 
to tridiagonal form. These techniques work in much the same way that the direct 
factorization techniques we developed in Section 3.5 compute the LU decomposition 
of a matrix. One such technique will be treated in the exercises. 

Here, we will restrict our attention to a reduction algorithm based on the use 
of Householder matrices. 


Definition. A HOUSEHOLDER MaTRIX is any matrix of the form 
H =I1-—2ww’, 
where w is a column vector with w7w = 1. 


It is quite easy to show that Householder matrices are both symmetric and 
orthogonal; that is, H~! = H (Exercise 2). Geometrically, multiplication of a 
vector x by the Householder matrix H results in the reflection of x across the 
hyperplane whose normal vector is w (see Figure 4.7). 

In practice, the Householder matrices are not computed explicitly, only the 
vector w is computed. For, once the vector w is known, the similarity transforma- 
tion HA is given by 


(I — 2ww!)A(I — 2ww7) = A —2ww" A —2Aww’ + 4ww!? Aww’, 
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x xX xX xK x 
x xX xX X xX 
x xXx X xX xX Hae 
x xX xX & XK 
x x xX xX xX 
Ho Hy AH He 


A3H2H AW, Nes 


eooxK Xx 9DO%X xX xX OxK xX X x 
ooxXx x xX OOxX XX OX x xX x 
Ox KX CO COxX xk xX xX OX xX XxX 
xXx XX OO KX XOO XK X X XK xX 
xX OTO XK xXKXO0D 0 xxooe 


Figure 4.8 Illustration of Householder reduction to symmetric tridi- 
agonal form for a 5 x 5 matrix. Each x denotes an element that is not 
necessarily zero. 


which is completely determined by w. The computation of HAH can be simplified 
tremendously if we define u = Aw and K = w'u = w! Aw. Then 


HAH = A-2ww" A- 2Aww" + 4ww? Aww? 
= A-—2wu! — 2uw? + 4Kww" 


=A-2w(ul — Kw") - 2(u- Kw)w?. 


If we now let q = u— Kw, then HAH = A—- 2wq! — 2qw?. 

The algorithm to reduce a symmetric matrix to tridiagonal form using House- 
holder matrices involves a sequence of n-2 similarity transformations, as illustrated 
in Figure 4.8 for the case n = 5. The first Householder matrix, Hy, is selected so 
that H,A will have zeros in the first n — 2 rows of the nth column and the nth row 
of A will not be affected. By symmetry, when H, AA is computed to complete 
the transformation, the zeros in the nth column will not be changed, but zeros will 
appear in the first n — 2 columns of the nth row. Each subsequent Householder 
matrix, H; (i = 2,3,4,...,n— 2), is then selected so that 


Hy, Hy_1°++ 2H, AH) Ha: Ay-1 


will have zeros in the first n — 7 — 1 rows of the (n — i+ 1)st column but will not 
affect the bottom i rows. Completing the ith transformation will place zeros in the 
first n —i—1 columns of the (n — i+ 1)st row. 
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Determining the appropriate Householder matrix for use in each step of the 
above algorithm requires the solution of the following fundamental problem: 


Given an integer k and an n-dimensional column vector x, select w so 
that Hx = (I — 2ww")x has zeros in the first n — k — 1 rows but leaves 
the last & elements in x unchanged. 


Note that this problem specification contains only n—1 conditions on the vector w. 
The last condition comes from the requirement that w7 w = 1, or, equivalently, that 
the vector (I — 2ww")x have the same Euclidean norm as the vector x. 

To solve this problem, first note that in order for the last k elements in x to 
be unchanged, the last k elements in w must be zero. This guarantees that the last 
k rows and columns of H are identical to the identity matrix. Thus w must be of 
the form 


w=([w wa Wy > Wnek O 0]. 


Let b = (I — 2ww7)x, where by construction b will have the form 
b= [ O --- O @ Sniper <*: Zn | 


with n —k—1 zeros at the beginning of the vector. Since multiplication by the 


Householder matrix must preserve the Euclidean norm, we must have b?b = x! x, 


which implies that 
ao aai+opt+azt--- +27 _y. 


To proceed further, let’s rearrange the equation defining the vector b as 
x —2ww?x=b. (1) 


Premultiplying equation (1) by w7 yields 


T 


wx — 2wl ww 


x=wb 


7 


which simplifies to 
-w' x = AWn-—k (2) 


upon taking into account the form of both w and b and using the fact that 
ww = 1. Substituting equation (2) into equation (1) produces 


X + 2aWn_~wW = b, 
or, in component form, 


Zit 2own-~pwi=O (= 1,2,3,...,.n-—k-1) 
Ink + 20w?_, =a. 


From the last of these equations we see that 


1 Tn-k 
on ff) 
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To avoid cancellation error, we will choose sgn(a) = —sgn(%n-x). With wn—r 
determined, the remaining nonzero entries in w are given by 
ab v4 


W= 
2 QWy—k 


(i=1,2,3,....n—k-1). 


EXAMPLE 4.8 Reduction to Tridiagonal Form 


Consider the symmetric 4 x 4 matrix 


Sera: 
ae ee 
Sy eer, 84 4 
oe 


For the first step of the reduction to tridiagonal form, we want to produce zeros in 
the first two rows of the last column of A and leave the last element in that column 
alone. Therefore, we are working with k = land the vectorx=[2 -2 1 4]°. 
With this vector, we compute a? = 2? + (—2)? + 1? = 9 and since sgn(z3) = +1, 
we choose a = —3. It then follows that 


eg es 


2 -38 3 

w. ils ER es VS and 
7 2-3(Y6/3) 8 
Sys ge tee WS 
m3 3(V6/3) 8 


Hence, w = (V6/6) [ 1 -1 2 0 Ne Next, we compute 
u= Aw =(V6/6)[3 -5 5 6]; 
K =wu=3;and 


q=u-Kw=(v6/e)[0 -2 -1 6". 


Therefore, 
1 0 
1] -1 1| -2 
HAR = A-5 | 5 [Drea i OE Wg [1 -1 2 0] 
0 6 


a] ‘=4/3 4/8: 6 
a a) 

4/3 1 10/8 -8 

0 Oe. 2S od 
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For the second (and final) step of the reduction, we want to produce a zero in 
the first row of the third column of H;,AH; and leave the last two elements in 
that column alone. Therefore, we are working with k = 2 and the vector x = 


[ 4/3 1 10/3 -3 he With this vector, we compute a? = 25/9 and since 
sgn(z2) = +1, we choose a = —5/3. It then follows that 


w= $(-a_)- and 


2 5 
1 4/3 v5 
2 (—5/3)(2v5/5) 8 


Wy, = 


Hence, w = (V5/5)[ 1 2 0 0 ie Next, we compute 


u = Aw = (V5/5)[ -11/3 2 10/3 0)’; 
K=wiu=1/15; and 
q=u- Kw =(v5/5)[ -56/15 28/15 10/3 0 |”. 


Therefore, 
1 
2/2 
HH, AH, Hy = HAH; ~ = | 9 [-% #8 2 0] 
0 
56/15 
2°| 28/15 
-3! 10/3 [ke OO] 
0 


149/75 68/75 0 0 

68/75 33/25 -5/3 0 
0 =5/3 10/3. 3 
0 0 =o 


Obtaining the Eigenvectors of a Symmetric Matrix 


While computing all of the eigenvalues of a symmetric matrix, it is possible to 
simultaneously compute the corresponding eigenvectors. The key observation that 
makes this possible is that when starting from the symmetric matrix A, the end 
result of all transformations (including those from both the first and second stage) 
will be the diagonal matrix D. Let the diagonal entries of D be denoted, in order, 
by Ai, Az, A3,-.-,An- Not only is each A; an eigenvalue of D (and also of A), but 
the eigenvector associated with A; is e;, the 7-th column of the identity matrix. In 
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other words, De; = \;e;. But 
D=M"Hy-2Hn-3°* Hy Ay + Hp-3Hn-2M, 


where H; is the ith Householder matrix and M accounts for all of the similarity 
transformations performed during stage two—the reduction from tridiagonal to 
diagonal form. Therefore, 


M™ By-2Hn—3°> Hy AH, -»- Hp-3Hy-2Me; = Ai®i, 
or 
AM, ++: By-3Hy-2Me; = 4H, -+» Hyn-3Hn—-2M ei. 
Hence the eigenvector of the original matrix A associated with the eigenvalue , is 
Ay ++ Hn-3Hn-2M ej. 


If we were now to group the eigenvectors of A into a matrix, placing the eigen- 
vector associated with the eigenvalue A; in the ith column, the resulting matrix of 
eigenvectors would be 

Ay): Ay -3Hn-oM. 


Based on this analysis, we see that to be in a pasition to determine the 
eigenvalues and eigenvectors of a symmetric matrix simultaneously, the product of 
all of the Householder matrices used to reduce the original matrix to tridiagonal 
form must be accumulated so that the computation of the eigenvectors can be 
continued during the reduction to diagonal form. To accomplish this task, let V 
denote the matrix of eigenvectors of A, and initialize V to be the identity matrix. 
As each step of the reduction of A to tridiagonal form is carried out, postmultiply 
the current V matrix by the corresponding Householder matrix; .e., replace V by 
V H; for each 4 = 1,2,3,...,n—-2. After the final similarity transformation has been 
applied, V will be equal to the product 4, H2H3:--:H,-3H,-2 and will contain all 
of the information needed to complete the calculation of the eigenvectors. 


EXAMPLE 4.9 Preparing to Find the Eigenvectors from the Previous 
Example 


Suppose that we wish to obtain all of the eigenvectors of the matrix 


ai aot 2 
-2 3.0 <2 

A=|, o 21/3 
Do eo Tt: 


in addition to all of the eigenvalues. In the previous example, we found that A was 
similar to the tridiagonal matrix : 


149/75 68/75 O 0 

68/75 33/25 -5/3 0 
0 5/3 10/3 -3 
0 0 3 4 
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We arrived at this tridiagonal matrix by applying two similarity transformations 
to A. The first was based on the Householder matrix Hy, = J — Qwiwi, where 
w, = (¥6/6)[ 1 -1 2 0 13 and the second was based on the Householder 
matrix Hz = I — 2waw%,, where wa = (V5/5) [ 12 0 0 i 

To be in position to compute the eigenvectors when we reduce the above 
tridiagonal matrix to diagonal form, we must compute the matrix V = H;H2. The 


first step is to initialize the matrix V to the 4 x 4 identity matrix. Following the 
first similarity transformation, we replace V by 


1 
ViaSat ath Ben 
LS Se ob 
0 
2/3 1/3 -2/3 0 
1/3 2/3 2/3 0 
2/3 2/3 —-1/3 0 |? 
G0 SO 


and then, following the second similarity transformation, we replace the current V 
by 


2/3 1/3 -2/3 0 
oe |e hee) ome oe 
Via -2/3 2/3 ~-1/3 0 
O° +"? “Oe a 
2/15  -11/15 -2/3 0 
“1/8 <9) 2/8 0 
0 
1 


hy 
} 
I 


[1 20 0] 


OoOonwr 


-14/15 2/15 —1/3 
0 0 0 


This matrix contains all of the information needed to complete the calculation of 
the eigenvectors as the above tridiagonal matrix is transformed into a diagonal 
matrix. 
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EXERCISES 


1. 
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Let @ be an n x n orthogonal matrix. Show that 
(a) (Qx)" Qy = x’ y for all n-vectors x and y. 
(b) if A is an eigenvalue of @, then || = 1. 


(c) p(Q) =1. 
(d) ||Ql2 =1. 
(e) K2(Q) =1. 


. Let H be a Householder matrix. Show that 


(a) A is symmetric. 
(b) H?H = HHT =]. 


» (a) Let Qi and Q2 be orthogonal matrices. Show that the matrices Q1Qe and 


Q2Q1 are orthogonal. 
(b) Let Hy and H2 be Householder matrices. Is the matrix H, H2 necessarily a 
Householder matrix? 


. Let wi and we be vectors satisfying ||wi||2 = ||wall2 = 1 and wi we = 0, and 


let H, and H2 denote the Householder matrices associated with w; and we, 

respectively. 

(a) Show that H,H2 is a Householder matrix. What is the corresponding w 
vector? 

(b) Show that HoH, is a Householder matrix. What is the corresponding w 
vector? 


. For each vector w, construct the corresponding Householder matrix H. Show 


that H is symmetric and orthogonal. 


(a)w=[3 3 3]? 
(byw=[3 4 -} 3)" 
() w= [2 5 1]* 


(d)w=se[4 -1 2 -l 


. For each vector x and integer i, determine the Householder matrix H so that 


Hx has zeros in the first n ~ i — 1 entries. 


(a)x=[4 32 5]? i=l 
(b)x=[3 2 202 -6]° i=2 
(c)x=[12 -1 9 5 2]? G=1 


(d)x=[20 28 12 18 12 7]? i=2 
(e) x=[1 -1 2 3 -1 2]? i=3 


For Exercises 7-13, reduce the given symmetric matrix to symmetric tridiagonal form 
and compute the matrix V which has the information needed to complete the calculation 
of the eigenvectors. 


7. 


5.5 -2.5 -2.5 —1.5 
—2.5 5.5 1.5 2.5 
2.5 1.5 0.0 2.5 
-15 2.5 2.5 9.90 


fos 
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a ae sar 
See ee 
BS a5: a SG 
tat. or 38 
ae a a ee 
ef. “Bs ahi a. 0 
Oeil A Sb “7 
0 1 29 0 
a 0 1 0 -—2 
a a ae 
ae ee 
(OAS oh seth i eas 40 
0 2 -- 9 1 
1-10 1 2 
a4 0.25 0 0 0 0 
0.25 0.25 0 1 0.25 0 
0 000 1 0 
BOA) Og ie Sicpe ae ow 
0 0.25 1 0 0.25 0.25 
0 0 00 025 1 
ae ae ee ee 
a ee ae ae 


12, A=] -1 -2 20 -5 -4 


13. A= 


Exercises 14-16, consider the Lanczos method for transforming a symmetric matrix to 
symmetric tridiagonal form. Let A be an n x n symmetric matrix and suppose that Q 
is an orthogonal matrix for which Q? AQ = T, where T is the tridiagonal matrix 


bn-1 @n-1 bn 
bn Gn 


Further, let q; denote the zth column of @ and take b1 = bn41 =0 and qo = 0. 


14. (a) Show that ai qi = 1 for each 7 = 1,2,3,...,n, but qi aj = 0 whenever 
i$ j. 
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L5. 


16. 
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(b) Show that Aq; = biqi-1 +aiqit biti1qi+1 for each i = 1,2,3,...,n. (Hint: 
Q7 AQ = T is equivalent to AQ = QT.) 

(c) Show that a; = q? Aq. 

(d)} Show that 6:41 = +||Aqi — biqi-y — @:qi|l2- 

Use Exercise 14 to construct an algorithm to determine the elements of the 


tridiagonal matrix T’ and the columns of the matrix @ given an arbitrary vector 
qy with ||qy|/2 = 1. 


Use the algorithm of Exercise 15 to transform the symmetric matrices of Exercises 
7-13 to symmetric tridiagonal form. 


In Exercises 17-19, we will examine a deflation technique based on Householder matrices 
and similarity transformations. 


17. 


18. 


19. 


Suppose A is an n x n matrix with eigenvalue A, and associated eigenvector v1. 

(a) Construct a Householder matrix H such that Hv, = ae, where @ is a 
nonzero constant and e, is the first column of the identity matrix. 

(b) Let B = H? AH, where H is the Householder matrix constructed in part 
(a). Show that Be, = \;e;. From this we may conclude that 


Al X «ct: &X 


B 
0 


where B is an (n — 1) x (n— 1) matrix and x denotes an element that is 
not necessarily zero. 
(c) How are the eigenvalues of B related to those of A? 
(d) How are the eigenvectors of B related to those of A? 
Use Exercise 17 to construct an algorithm to determine the two dominant eigen- 
values and associated eigenvectors for a matrix A. 
Use the algorithm of Exercise 18 to determine the two dominant eigenvalues and 
associated eigenvectors for the specified matrix. 
[ -~10l 51 -12 0 0 
-174 88 ~-20 0 0 
(a) 136 ~68 19 0.6 (0 
840 -—420 105 -32 18 
2106 -1008 252 —-84 46 


Pe 62 l 0 1 


(b) | 0 -1 0 +1 0 


P10 <4 0 -<49 
a 35H" 14 
() | 9 0 2 0 
a ae oe 
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SO. 6 45. «tbr ees 
6 16 O -13 4 -9 ~2 
fe ee ee: a 
(Gy | ie E18 Be BOS Te a ek 
5 4 8 1 42 3 5 
Bo he oS a 
BoP 6g S38 esl ae 
1a yor 3 
1.3 0. 1 9 
(e) |} 1 0 -6 2 1 
0 2 a & co 
BO. A ue 2 


1.00 0.01 0.97 0.44 0.02 
0.01 1.00 0.15 0.69 0.86 
(f) | 0.97 0.15 1.00 0.51 0.12 
0.44 0.69 0.51 1.00 0.78 
0.02 0.86 0.12 0.78 1.00 


4.5 EIGENVALUES OF SYMMETRIC TRIDIAGONAL MATRICES 


We will now complete our study of the matrix eigenvalue problem by developing 
an algorithm to compute all of the eigenvalues of a symmetric tridiagonal matrix. 
The technique which we will develop in this section is called the QR algorithm. 
Unlike the technique developed in the previous section, the QR algorithm is iterative 
in nature. The sequence of matrices generated by the algorithm converges to a 
diagona! matrix, so the eigenvalues of the “final” matrix in the sequence are just the 
elements along main diagonal. As we will continue to use similarity transformations, 
these diagonal elements are also the eigenvalues of the original matrix. 


The Very Basics of the QR Algorithm 


We will start with a basic description of the QR algorithm and gradually develop 
the details. Let A = A® be a given matrix. The QR algorithm constructs the 
sequence of matrices {A“} as follows: for i = 0,1,2,..., 


e factor A into the product Q@ R©, where Q© is an orthogonal matrix (i.e., 
[Qe] = [Q]*) and R® is an upper triangular matrix; and 
* compute AGT) = ROQM, 
From the relation A@ = QMR, it follows that QO” A® = RO, since QW is 
an orthogonal matrix. The calculation in the second step is then equivalent to 
AG) = ROAQM = QO™AMQM. Hence, each iteration performs a similarity 
transformation with an orthogonal matrix, which implies that the eigenvalues of 
AC@+)) are identical to those of A®. 
As just described, the QR algorithm can be applied to any matrix. We will, 
however, discuss the implementation of the QR algorithm for symmetric tridiagonal 
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matrices only. For details of the algorithm applied to more general matrices, consult 
Wilkinson [1], Golub and van Loan [2], or Press, et. al. [3]. 

What is the effect of performing the iterations of the QR algorithm? Consider 
the symmetric tridiagonal matrix 


4 3 0 
AM=/3 1 -4 
ae 


A portion of the sequence {AM} is 
5.923 —0.276 0 


A® =| -0.276 2.227 —1.692 
0 —1.692 —0.155 
5.950 -—0.0664 0 
AM) = | -0.0664 3.071 —0.241 
0 —0.241 —1.021 
5.951  —0.0178 0 
A) = | -0.0178 3.084  —0.0272 
0 ~0.0272 ~1.035 
5.951 0.00478 0 
A®) =| 0.00478 3.084 ~0.00306 
0 —0.00306  —1.035 
5.951  —0.00128 0 
A) — | _09.00128 3.084 —0.000345 
0 —0.000345  —1.035 


Observe that the off-diagonal elements are converging toward zero, while 
the diagonal elements are converging toward the eigenvalues of A@, which, to 
three decimal places, are 5.951, 3.084 and —1.035. Further, the eigenvalues appear 
along the diagonal of A™ in decreasing order of magnitude. Francis [4] has shown 
that this example demonstrates the general performance of the QR algorithm and 
(41) 


that the off-diagonal entries los] converge toward zero with rate of convergence 


O(|A3/A3-1))- 
Now let’s start filling in some of the details. 


QR Factorization 


The heart of the QR algorithm is the calculation of the QR factorization of the 
matrix A“, Given the simple structure of the matrices with which we are going 
to work (symmetric tridiagonal), we will perform the QR factorization using what 
are known as rotation matrices. 


Definition. Let i < 7. The orthogonal matrix, P;,;), which is identical to 
the identity matrix with the exception that 


Pia =Pj,p =cosO and piy = —pyi = sind, 
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for some angle 6, is called a ROTATION MATRIX. 


The name rotation matriz arises from the geometric fact that Pj;,;) represents 
the rotation of the ith and jth axes about the origin of the coordinate system by 
an angle of @. For later use, it is important to note that premultiplication of an 
arbitrary matrix, M, by P,,;) affects only the ith and jth rows. In particular, 


ith row of _ Q. ith row sins jth row 
PogyM@ ~ of M TRY’ of ig 
and ; 
jthrowof  , ith row jth row 
Boe = —siné- ’M +cosé- f M 


The factorization of the symmetric tridiagonal matrix A@ now proceeds in 
exactly the same manner as the matrix factorization algorithms we developed in 
Chapter 3. For an 1x7 matrix, we make n—1 passes through the matrix, with each 
pass “zeroing” out a specific element below the main diagonal. Thus, in the first 
pass, FP, 2) is chosen so that Po AM has a zero in row 2, column 1. Next, Pyo,3) is 
chosen so that Pes) PayA has a zero in the third row of the second column, P3,4) 
is chosen so that Po3,ayP¢2,3)Pe,29 AM has a zero in the fourth row of the third co)- 
umn, and so on. Finally, Pin—1,,) is chosen so that Pinan) Pts.) Pe,3) P24 
is an upper triangular matrix. Hence, R@ = Pun-in)*'' Poa) Pe2,3) Pay A”. 

To examine the details of this factorization scheme more closely, let 


a, by 
by ag bo 


be az b3 
ales > 


bn-2 An-1 On-1 
by-1 an 


For notational convenience, let the cosine and sine values associated with the ro- 
tation matrix Pi; 341) be denoted by ¢; and s;, respectively. Carrying out the 
multiplication Pa yd, we find 


Cy $y a, oy 
—s, Cy by ag bg 
1 bg a3 bg 


acy + bys) b1¢, +498, 281 
—a1,8, tbe, —by8, +42¢) bey 
= be a3 bs ‘ (1) 
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We now want to choose c; and s, so that —a,s; + bjc; = 0. One solution of this 
equation, which also satisfies the fundamental trig identity c? + s? = 1, is 


ay b 
9 and : 


: 81 = 
Jere Jae 
With c, and s; selected so that ~—a,s, + bic, = 0, note that @, appears on 


the right-hand side of (1) in the first row, first column only—the precise location 
of a; in the matrix Al). We may therefore overwrite a, with the expression 


= ay by — ,/o2 #2 
ac, + bys, = a, are, + by ate ay + bf. 
In a similar manner, we would like to save the first two elements in the second 
column of Poy A® in place of b; and ag; unfortunately, to calculate these elements, 
both 6; and az are required. However, if we save the current value of b; in a 
temporary variable, say t, we may then overwrite b) with the expression tc, + a2s1 
and ag with —ts, + @2c,. Finally, we save the value of b2 in the variable ¢ and then 
overwrite be with the quantity bec, = ic. We need to save b2 in order to calculate 
the sine and cosine values associated with the next rotation matrix. 

It turns out that the element in the third column of the first row of P12) A%, 
bgs81, does not need to be saved. The remaining passes in the factorization step 
do not involve the first row, so the indicated element will not be needed for any 
later calculations. Further, as we will see shortly, the calculation of the product 
ROQM can be carried out without knowing this element. Technically, by not 
saving the value 623), we are not obtaining the true QR factorization of the matrix 
AW. We are, however, maintaining all the information we will need to calculate 
R}Q® = A+), which, in our present circumstances, is the real objective. 

The calculations required by all subsequent passes in the factorization step 
are identical to those indicated for the first pass, with two exceptions. First, we 
of course need to increment the subscripts for each new pass. Second, for the jth 
pass, with j = 2,3,4,...,n—1, cj and s; are given by 


Gj t 


Gg = |= Ss aand 35 = 
eee Yate 


since we’ve used t to save the old value of bj. We can therefore implement the entire 
factorization process as follows. 


save by in the temporary variable t 
for 7 = 1, 2, 3,..,m—-1. 
let r = fa? +7? 
compute c; = a;/r and s; =t/r 
overwrite a; with r 
save b; int 
overwrite b; with tc; + a;4155 
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overwrite aj41 with —ts; + aj41¢; 
if (j #n-1) 
save bj41 int 
overwrite b;41 with te; 
end 
end 


The first line in this pseudocode has been included so that the first pass can be 
handled in the same manner as all of the later passes. The final two statements 
have been placed inside a conditional statement since, during the last pass through 
the matrix, there is no element 6;,, = b, to overwrite. 


EXAMPLE 4.10 The QR Factorization of a Symmetric Tridiagonal Matrix 


Consider again the symmetric tridiagonal matrix 


4 3 0 
A=/;}3 1 -Il 
0 -1 3 


For this example, we have 


a,=4, @=1, ag=3, 0:=3, and b=-1, 


To prepare for the first pass, we set t = 6; = 3. We then calculate 


ay 4 t 3 
_ 2 i = eK d =e ss 
r=,jfaj+t=5, c . 5 and s hae 


and set a1 =r =5. Next, we set £ = b; = 3 and then calculate 


by = te, + 008) = 3; and 


dq = —ts, + aoc, = -1. 
Finally, set t = bo = —1 and calculate by = te, = —4. 
The second pass starts with the calculations 
r= faite = V2, a= 2a, and m= =e. 
After setting ag =r = V2 andt = bo = -$, we then calculate 
ba = tea + 0382 = —= 75 and 
a3 = —ts2 + a3¢2 = =a 


The results of our factorization of A are therefore 
a = 5, a=V2, a=-2% 
by = 3, bo = a 


4 3 
CL = 5) S15 
2 = 82 = AR 


We will now examine how to use these values to compute the product ROQM, 
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The Product ROQ® 

Earlier, we established that the upper triangular matrix in the QR factorization 
of the matrix A® is given by R© = Poin)‘ * P34) P(2,3)Pa 2A), Combin- 
ing this expression with the equation Q@’ A = RO, we see that Q@” = 
Pen-1,n) nee Po3.4)P(2,3)Pea,2)- This, in turn, implies that 


i) . pr T T 
QO = Play Poa Poy PE in): 


To form the product RQ, however, there is no need to compute the matrix Q® 
explicitly. Instead, we can save the 8; and c; values associated with each rotation 
matrix, P; 441), and then postmultiply R® by the transpose of each rotation ma- 
trix in succession. To carry out each multiplication we make use of the relations 


ath coluran of _ ith column ‘ jth column 
MFT. =cosé- of M +sin@- of M (2) 
ancl a 
jth column of — ath column jth column 
MPT. =-—sing: of Mt C088 of Mm (3) 


As with the factorization process, we can deduce the complete sequence of 
calculations for obtaining the product RQ by examining just the first multipli- 


cation, R® Pi»). We find 
ay by ey Cy —Ss} 
ag be eg 8] Cc} 
az 63 eg 1 


0307 +618; -a;8; +b\c, ey 
028} a2cy bg eg 
ies a3 63) e3 : (4) 


Here, the e; denote the values that we know are present in R® but that we did 
not save during the factorization step. 
We now make two very important observations. First, based on equations (2) 


and (3), we know that postmultiplication by P25»), Pe 4p > Pian) Will have no 


effect on the first column of ROPE oy Therefore, the first column of ee Doh as 
shown on the right-hand side of (4), is the first column of R®Q@. Second, since 
A® is symmetric and each Q® is orthogonal, it follows that AC@+) = ROQM 
must also be symmetric (see Exercise 1). Consequently, not only do we know the 
first column of R@)Q® after this first multiplication, we know the first row as well. 
Thus, calculation of the values along the main diagonal and below must be carried 
out, but calculations above the main diagonal are unnecessary. 
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Bringing all this information together, it follows that to obtain the product 
REQ, we need to perform the operations 


overwrite a; with a,c; + b;8;; 
overwrite 6; with a;413;; and 
overwrite @;41 with aj41¢;, 


for 7 = 1,2,3,..,2 —1. Note that the e; do not play a role in any of these 
computations. 


EXAMPLE 4.11 The Product RQ from the Previous Example 


The results of our factorization of 


3 2 #O 
Az=|2 -5 -l 
0 -1 4 
were 
a, = 95, an = V2, a3 = —2%, 
b = 3, by = — 25, 
co = 4, a = 3, 
— 1 = 1 
C2 = ~ 73, 82 = 


The first set of calculations leading to the product RQ) yields 
29 


Oy = a,C, + bys, = re 5.8: 
3V2 

by = a8) = ee and 
4/2 

a2 = a9C, = 3° 


The second set of calculations then gives 


4 11 

G2 = AoCg + bo82 = Shae 0.3; 
5 610 

bp = 4389 =1.9; and 


23 = a3C2 = 1.9. 


Hence, 
5.8 0.848528 0 
AM — ROOM) — | 0.848528 03 19 
0 19019 
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Accelerating Convergence 


Since the rate of convergence of the off-diagonal elements to zero is O(|A;/A;-1]), 
convergence of the sequence {AM to diagonal form will be slow whenever the 
eigenvalues of A are closely spaced in magnitude. To accelerate convergence, we 
can shift the eigenvalues of A\ by subtracting a multiple, o;, of the identity matrix, 
much as is done in the inverse power method. This changes the first step of each 
iteration to 


« factor AM — @,I into the product QM RY, 


Technically, the computation of AC+Y should then be A@*) = ROQO + of 
so that AC*) will be similar to A®. The addition of of is typically not done, 
however. Rather, the shifts, o;, from all of the iterations are accumulated, and 
the accumulated value is added to the diagonal entries once convergence has been 
obtained. 

The two most common choices for o; are 


(1) o; = a, or 


: Ons ae : (2) 
(2) oj = the eigenvalue of | 75 G | that is closest to an’. 
Oya, at 


Here, and below, a denotes the jth diagonal element and pl? the jth off-diagonal 


element of AM, The QR algorithm converges much faster with one of these shifts 
than with no shifting at all (see Wilkinson [1]). The second shift, often called the 
Wilkinson shift, is generally preferred and usually produces cubic convergence (see 
Wilkinson and Reinsch [5]). 
With either choice of the shift parameter, we are essentially trying to force 
ae ) to be the off-diagonal element which converges to zero fastest. Accordingly, 
oe 


after each of the initial iterations of the algorithm, we check the size of | , and 


when this value falls below a specified convergence tolerance, the value go) Ss 
is accepted as an eigenvalue of A). Here, 5 denotes the sum of all shifts that 
have been carried out, starting from the first iteration up to the current iteration. 
At this point, we no longer need to include the nth row or the nth column in the 
calculations. For subsequent iterations, the shift is therefore chosen as either ae 
(8) () ; 

(3) 


or the eigenvalue of | ay? mod | that is closest to a), and the size of petD 


n-2 


a 
n-~2 a1 ; 
is monitored. When this value has converged to zero, oe 45 is accepted as 
another eigenvalue of A, and calculations proceed on the first n — 2 rows and 


columns. This process, 


checking the magnitude of a specific off-diagonal clement, 
accepting an eigenvalue upon convergence of the off-diagonal element, and 
proceeding with calculations on one fewer row and column 
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continues until jot | converges to zero. At this point, aft?) + and git) +h 


1 
are the final two eigenvalues of A). 

The following pseudocode summarizes the QR algorithm. In this code, the 
variable © accumulates the shifts, the variable last indicates the portion of the 
matrix to be included in calculations and the parameter TOL is the specified con- 
vergence tolerance. Note that mn — last is the number of eigenvalues that have 
already been determined. 


initialize 2 = 0 and last=n 
for i = 1, 2, 3, ..., repeat until fast = 1 
compute the shift, o 


a5 z. 4 atey work only with rows 1, 2, 3, ..., last 
factor A® into QOR® and columns 1, 2,3, ..., ast 


compute A@+)) = ROQM 
if |bsase-i] < TOL 
report Qtas¢ + U as an eigenvalue 
last = last — 1 
end 
end 
report ai + D as an eigenvalue 


The Algorithm in Action 
Now let’s put the entire QR algorithm into action. Take 


fe 
AMO=)3 1 -1 
0 -1 38 


We will let TOL = 5 x 1074 be the convergence tolerance, and we will implement 
the Wilkinson shift. Starting with last = 3, we compute the eigenvalues of 


a) of) = 1 -l 
iO gO St. cde 
which are 2+/2. The eigenvalue that is closest to a) =3is2+ 72 & 3.4142, so 


we take this value for the first shift. Thus, o = 3.4142 and © = 3.4142. Factoring 
the matrix AM — oJ and computing RQ yields 


-1.1755 3.4850 0 
AG) = | 3.4850 -0.7376 —0.09673 
0 —0.09673 ~-0.3296 


Since [of | > TOL, we continue to work with the entire matrix. 
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For the second iteration, o = —0.3078. Thus D = 3.1064. Factoring A()) — gf 
and computing RQ yields 


—~2.0894 3.1822 0 
A® =) 3.1822 0.7926  ~0.0006623 
0 —0.0006623  —0.02244 


. 2 ‘ 
Since Ds | > TOL, we work with the entire matrix for yet another iteration. 
In the third iteration, we find o = —0.02244. Thus © = 3.08397. Factoring 
A) — ¢I and computing RQ now yields 


2.9474 2.6102 0 
A® =} 2.6102 1.6955 9.817 x 107}! 
0 9.817 x 107-1! 4.6138 x 10-7 


At this point, note that jaf?) | < TOL, so we accept 4.6138 x 107” as an eigenvalue 
of a This implies that 43 = 3.08397 + 4.6138 x 10-7 = 3.08397 is an eigenvalue 
of Ae : 

Having determined one eigenvalue, the fourth iteration works on only the first 
two rows and columns of A®), Therefore, the shift is chosen as the eigenvalue of 


ap) | | 2.9474 2.6102 
oa) | ~ | 2.6102 1.6955 


that is closest in value to a?) = 1.6955. This gives o = 2.8673, and then U = 5.9513. 
The matrix A@) is found to be 


—6.9865 0 0 
A) = 0 0 9.817 x 10-4} |. 


0 9.817 x 1071! 4.6138 x 1077 


Since | < TOL, we accept a) = 0 as an eigenvalue of A), which produces 


the second eigenvalue of AO: Xp = 045.9513 = 5.9513. With only one eigenvalue 
left to be found, it follows that A; = —6.9865 + 5.9513 = —1.0352. 

It is interesting to note that had we required the more restrictive convergence 
tolerance of TOL =5 x 10774, just one more iteration of the QR algorithm would 
have been needed. 


Determining Eigenvectors 


With the QR algorithm, we can compute the eigenvalues and eigenvectors of a 
symmetric tridiagonal matrix simultaneously. Let V be the matrix of eigenvectors, 
with the jth column being the eigenvector associated with ;. If A was obtained 
as the result of reducing a symmetric matrix to symmetric tridiagonal form, then V 
should be initialized to the matrix of eigenvector information produced by the re- 
duction algorithm; otherwise, V should be initialized to the identity matrix. Within 
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the pseudocode for the QR algorithm, after factoring A® but before performing 
the convergence check, replace V by 


ir 
VPC 2) Pe2,3) F634) ey Phat aes 


repeatedly making use of equations (2) and (3) with the appropriate sine and cosine 
. values. This will accumulate the effect of each of the orthogonal matrices Q®. 
The justification for this equation follows the same argument as was presented 
in Section 4.4 regarding the eigenvectors of two matrices related via a similarity 
transformation. 


EXAMPLE 4.12 Determining Eigenvectors 


Since the matrix 


is originally in symmetric tridiagonal form, we initialize V to the identity matrix. 
During the first iteration of the QR algorithm, the sine and cosine values associated 
with the rotation matrices Py, ,2) and Py2.3) are 


c, = 0.191643, co = —0.959524, s, = 0.981465, and so = —0.281628. 
Hence, we replace V by 


VPA.2)Pé,3) = 1 | 0.981465 0.191643 0 0 ~0.959524 0.281628 
0 0 1 0 —0.281628 —0.959524 


asus 0.941739 aesar | 


0.191643 —0.981465 0 E 0 0 | 


0.981465 —0.183886 0.053972 
0 —0,.281628 —0.959524 


During the second iteration, the sine and cosine values associated with the 
rotation matrices P1,2) and Pra,3) are 


¢, = —0.241604, @= ~0.999565, 8, = 0.970375, and 42 = ~0.029498. 


The eigenvector matrix V is therefore replaced by 


0.241604 -0.970375 0] [1 0 0 
VPE PGs) =V | 0.970375 —0.241604 0 | | 0 -0.999565 0.029498 
0 0 1] [0 -0.029498 —0.999565 


—0.415564 0.905974 -0.080731 


0.867538 0.421467 0.264091 
0.273285 0.039709 0.961113 | 
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Following the third iteration 


—-0.119119 0.957057 —0.264294 
V= | 0.986132 0.145023 0.080700 
0.115564  -—0.251016 —0.961060 


and after the final iteration 


0.500622 —0.824334 —0.264294 
V=1 —0.840249 —0.536162 0.080700 
—0.208228 0.181672 -—0.961060 


Thus, the matrix 


ow fF 
i 
| 


has eigenpairs 
(1.0352, [ 0.500622 ~0.840249 —0.208228 ]"), 


(5.9513, | —0.824334 ~0.536162 0.181672 ]’), 


(3.0840, [ —0.264294 0.080700 —0.961060 }”), 


where the eigenvectors are normalized to unit length in the lo-norm. 
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EXERCISES 


1. Let A bea symmetric matrix. Prove that the matrices A) produced by the 
QR algorithm are symmetric for all i. 


In Exercises 2-5, perform one iteration of the QR algorithm with Wilkinson shift on 
the indicated matrix. 


3.2 0 4 -2 0 
2,A=/] 2 —5 a wan | 6 ‘| 
0-1 4 0 » -=3 
12 ~#«1 0 7 2 #0 
wa[ 3 2| san | -8 1 | 
0 -2 5 0 1 12 


In Exercises 6-11, determine all of the eigenvalues of the indicated matrix using the 
QR algorithm with Wilkinson shift. Use a convergence tolerance of 5 x 107° if working 
in single precision and 5 x 10-** if working in double precision. Record the number 
of iterations needed to obtain the first eigenvalue and the total number of iterations 
needed to find all of the eigenvalues. 


7 -2 
23 «5 
6. A= 5 -6 1 
12 
L 2-4] 
aca 
126 
1 —1 
7 A= -1 4 2 
2 7 =i 
-1 6 
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10. 


11. 


Chapter 4 
5 
2 
A oe 
4 
-1 
A = 


In Exercises 12-18, 
each matrix. 


13. 


14, 


16. 


17. 


12 
ail 
1 
0 
3 


Eigenvalues and Eigenvectors 


compute all of the eigenvalues and corresponding eigenvectors for 


-l1 21 0 3 
3 0 2 O 
Oo -6 2 1 
2 2 9 0 
0 1 0 ~2 
1 
-8 -I 
-1 2 -1 
m1; 409) 4 
1 12 
1 
12. -2 
—2 8 4 
4 12 -8 
-3 18 
2 -1 -3 —-6 
18 —2 5 4 
-2 20 -5 —4 
~-5 28 1 
4 -4 1 12 
-l1 0 -~-1 0 =1 
2 -1 0 0 0 
-~-1 2 -1 0 0 
0 -i 3 -1 O 
0 oO -1 2 -l 
0 oO O -1 2 
2 
3.4 
4 5 6 
6 2 Tt 
-1 8 -2 
—2 9 
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62 6 15 1 5 
6 16 0 -13 4 -9 ~2 
1 O 658 --3 8 
18. A= 1 -13 -3 30 1 
5 64 8 1 2 3 5 
3. 9 1 -7 3 28 —-1 
5 -2 6 -3 5 -l 44 
19. According to Huckel theory (see Cvetkovic, Doob, and Sachs [6, Chapter 8] 
or Longuet-Higgins [7]), the energy levels and wave functions of hydrocarbon 
molecules are related to the eigenvalues and eigenvectors of the adjacency matrix 
of the graph that represents the carbon skeleton of the molecule. The carbon 
skeletons for three hydrocarbons are shown below. 
(a) Compute all of the eigenvalues and eigenvectors of the adjacency matrix 
associated with isopropylbenzene. 
(b) Repeat part (a) for 4-ethyl, 2-methylhexane. 
(c) Repeat part (a) for isobutyleyclopentane. 


Cums 


Isopropylbenzene 4-Ethyl, 2-methylhexane 


r= 


Isobutylcyclopentane 


20. In simulating the nuclear magnetic resonance (NMR) spectra of three magneti- 
cally inequivalent protons, the transition frequencies and intensities are related 
to the eigenvalues and eigenvectors of the spin Hamiltonian matrix 


S11 0 0 0 9) 0 0 0 
0 $22 J23/2 Ji3/2 0 0 0 0 
O Jo3/2 533, a /2 0 0 0 0) 
0 13/2 Sie/2 84a 0 0 0 0 
0 0 0 0 855 J12/2  Ji3/2 0 
0 0 0 0 Ji2/2 866 ~—J23/2 0 
0 0 0 0 J13/2  Jo3/2 877 i) 
0 0 0 0 0 0 — 


336 Chapter 4 Eigenvalues and Eigenvectors 


The diagonal entries are given by the formulas 


8yi = (vr + v2 + 3)/2 + (ia + Sag + Jo3)/4 
$29 = (v1 + v2 — v3)/2 + (Ji2 — Jia — Jo3)/4 
833 = (v1 — ve + 03)/2 + (—Jy2 + S13 — Jo3)/4 
844 = (—v1 +: v2 + u3)/2 + (—Jie — Jig + Jo3)/4 
855 = (V1 — V2 — v3)/2 + (~Ji2 — Jia + Jog)/4 
866 = (—v1 + v2 — ¥3)/2+ (—Sy2 + Jig — Joa) /4 
$77 = (—v1 — v2 +: 03)/2 + (Jia — Jig — Jo3)/4 
8gg = (—v1 — v2 — v3) /2 + (Jia + Jig + Jo3)/4. 


The parameters in this system are three chemical shifts: 7, v2, and v3; and 
three coupling constants: Ji2, Ji3, and Jo3. Take. 


v1 = 342.0 Hz, vo = 364.6 Hz, v3 = 372.2 Hz, 
Jig =11.75 Hz, Jig = 17.90 Hz, Jo3 = 0.91 Hz 


and determine all of the eigenvalues of the spin Hamiltonian matrix and their 
associated eigenvectors. 


CHAPTER 5 


Interpolation (and Curve 
Fitting) 


AN GVERVIEW 
Building a Table of Logarithms 


A publisher of mathematics textbooks needs a table of values for the common loga- 
rithm function (i.e., the base 10 logarithm) for one of its new precalculus textbooks. 
Each entry in the table is to be accurate to six (6) decimal places. The table must 
include entries for uniformly spaced values of « ranging from 1.0 to 10.0, and the 
increment between z values must be small enough so that linear interpolation be- 
tween any two entries in the table introduces an error of less than 10-6. What is 
the maximum possible increment tiat can be used in the construction of this table? 


Properties of Water 


Table A.5 in Frank White, Fluid Mechanics, lists the following values for the surface 
tension, Y, vapor pressure, p,, and sound speed, @, for water as a function of 
temperature. Based on these values, what are the surface tension, vapor pressure 
and sound speed for water when T = 34° C, 68° C, 86° C, and 91°C? 


TCC) Y(N/m) py (kPa) @ (m/s) 


0 0.0756 0.611 1402 
10 0.0742 1.227 1447 
20 0.0728 2.337 1482 
30 0.0712 4,242 1508 
40 0.0696 7.375 1929 
50 0.0679 12.34 1542 
60 0.0662 19.92 1551 
70 0.0644 31.16 1553 
80 0.0626 47.35 1554 
90 0.0608 70.11 500 
100 0.0589 101.3 1643 


Probability of a Shutout in Racquetball 


The following table (drawn from Joseph Keller, “Probability of a Shutout in Rac- 
quetball” SIAM Review, 26, 267-8, 1984) gives the probability, P, that a given 
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player will shutout an opponent as a function of the probability that the player will 
win. any particular rally, p, regardless of who serves. 


p 10 O98 0.85 0.842 0.84 0.5 
P 10 0.753 0.534 0.500 0.490 0.0001504 


Suppose that a player estimates a 60% chance of winning any particular rally against 
a given opponent, regardless of who serves. What is the probability that this player 
will shutout the given opponent? With what probability must a player win any 
particular rally in order to have a 25% chance of shutting out an opponent? 


Data Analysis for the Spread of an Epidemic 


Suppose that a mathematical model for the spread of an epidemic produces the 
following estimates for the number of people who have died as a result of the epi- 
dernic, D(t), and the rate at which people are dying, D’(t). Here, time is measured 
in weeks. Using this data, we wish to generate a table which shows the number of 
dead at halfweek increments. 


t Dit) D'(t) 
0.000000 0.000000 600.000000 
0.750000  445.903683 573.579644 
1.500000 842.695315 477.074216 
2.085600 1095.211197 384,947629 
2.676193 1295.955674 296.576145 
3.219694  1437.602773 226.796410 
3.748513 1542.363644 171.475176 
4.279179 1621.280769 127.808738 
4.821254  1680.890649 93.728061 
5.000000 1696.803710 84.473801 


Fundamental Mathematical Problem 


Each of the examples just presented illustrates the fundamental mathematical prob- 
lem to be treated in this chapter: 


Given a set of points (2;, f;) fori = 0, 1, 2, ..., n, where the 2; are 
distinct values of the independent variable and the f,; are corresponding 
values of some function f, either 


approximate the value of f at some value of x not listed among the x; 
or 


determine a function g that in some sense approximates the data 
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Problem data can also include derivative values in addition to function values. 
In some instances, the problem data will be specified as the function f itself, rather 
than as a discrete set of points from the graph of f. In these cases, a function g 
that is less expensive to evaluate and/or easier to manipulate is sought. 

The mathematical problem stated above actually gives rise to two different 
areas of study: interpolation and approximation. In interpolation, the function g is 
determined by requiring the error to be zero at each of the z;; that is, by enforcing 
the conditions g{x;) = f; for each 1 = 0, 1, 2, ..., m. Hence, interpolation treats 
error in a somewhat local fashion. Approximation, on the other hand, treats error in 
a more global manner, requiring that some measure of error, such as the sum of the 
square of the difference between g(x;) and f;, be minimized. Although a final section 
on least squares regression (one possible approximation technique) is provided for 
completeness, the objective of this chapter will be to discuss interpolation. 

In addition to the variety of applications, such as those given above, for which 
interpolation is useful, interpolation is also a major tool for the development of other 
numerical techniques. Most of the algorithms developed in Chapter 6 (Differentia- 
tion and Integration) have their basis in interpolation. The same is true for many 
of the techniques which will be developed in Chapter 7 (Solution of Initia! Value 
Problems). 

There are many different types of interpolation, depending upon the class of 
functions from which g is selected. The most common forms of interpolation are 


1. polynomial interpolation 

2. piecewise polynomial (spline) interpolation 

3. rational interpolation 

4, trigonometric interpolation 

5. exponential interpolation 
The focus of this chapter will be on polynomial and piecewise polynomial interpo- 
lation. 

There are three major reasons for focusing on interpolation by polynomi- 


als and piecewise polynomials. First, as we’ve seen in Section 2.7 (Polynomial 
Rootfinding), the polynomial 


Pr(z) = do + a2 + anu” +--+ + anz” 
can be evaluated very efficiently using the synthetic division algorithm 


value := an 
for t from m — 1 downto 0 do 
value := a; +2 value 


Second, derivatives and integrals of polynomials are easy to compute and are still 
polynomials. Third, polynomials satisfy what is known as the uniform approxima- 
tion property. This is embodied by the following theorem due to Weierstrass, a 
proof of which may be found in Rivlin [1]. 
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Theorem (Weierstrass Approximation Theorem). Let f be continuous 
- the closed interval [a,b]. Given any € > 0, there exists a polynomial P such 
that 


If — Plleo = aa |f(e) — plz)| <e. 


Everyone who has taken a two-semester sequence in calculus is already familiar 
with one interpolating polynomial, though it was not referred to as such in calculus. 
This interpolating polynomial in disguise is none other than the Taylor polynomial: 


2 fH (ze 
f(z) & p(x) = _ f (x — 20)*. 
k=0 : 


This is an interpolating polynomial in the sense that it matches the function and the 
first n derivative values at the location z = zp. Since this polynomial uses a lot of 
local information about the function f, it is good for making local approximations. 
The error associated with the Taylor polynomial is given by 


fry 
Fin(z} = i 


(x 7 xo)"*!, 


where ¢ is between x and 2. The error is therefore bounded by 


n+l jn — x ci 
max | f(t 1 lear 


Note that this bound can be large in two different ways: If max | f+” (a)| is large 
or if z is far from zo. Nothing can really be done to improve results in the former 
case, but in the latter case, a different polynomial, one that uses information about 
f at many different locations, can be used. 


Remainder of the Chapter 


The remainder of this chapter is organized as follows. Section 1 introduces the 
Lagrange form of the interpolating polynomial and presents the error bound asso- 
ciated with polynomial interpolation. The next two sections provide answers for 
two important questions associated with polynomial interpolation. Neville’s algo- 
rithm, presented in Section 2, provides the most efficient means of determining the 
value of the interpolating polynomial at a single value of the independent variable. 
Divided differences and the Newton form are covered in Section 3. These provide 
the most efficient way to explicitly obtain the interpolating polynomial and eval- 
uate it for many z-values. The optimal choice of the interpolating points, z;, is 
discussed in Section 4, and the next two sections introduce piecewise polynomial in- 
terpolation. Piecewise linear interpolation is treated in Section 5, while cubic spline 
interpolation is the subject of Section 6. Section 7 discusses Hermite interpolation, 
which arises when both the value of the function and its first derivative are known 
at each 2,. The chapter concludes with a section on least squares regression. 
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5.1 LAGRANGE FORM OF THE INTERPOLATING POLYNOMIAL 


Let vo, 21, 22, .--, In be n+ 1 distinct, though not necessarily uniformly spaced, 
points along the real line, and let f; (i = 0, 1, 2,..., 2) denote the function value 
associated with the point z;. The 2; may be referred to as abscissas, nodes or 
interpolating points. At this time, problem data will consist of function values 
only; derivative values will be introduced in Section 5.7. A polynomial, P,, of 
degree at most 7 that satisfies P,(z;) = f; for each 1 = 0, 1, 2,..., n is sought. 
The fundamental concepts of this basic polynomial interpolation problem will be 
discussed in this section. 


Linear Interpolation 


Let’s start with the simplest case, that of linear interpolation. The data in this 
case consists of two abscissas, t9 and x,, and two corresponding function values, fo 
and f,. The objective is to find a linear polynomial, P;(x) = ag + aiz, such that 


Pi(zo) = ag +29 = fo and Pi(z1)=a9 taza = fi. 
The solution of these interpolating conditions is easily found to be 


fi — fo 


Zi\Jo— Tosi 
ay = ———_ and ag = 20a Boh 
£1 —2%o x1 — Zo 


so that 


P(x) = Bho %oh , haf, 
a ®, —- 2 21 —- Zo . 


The formula for P; can be rearranged into the form 


L-T) L— 2X 
P(z) = 0 
%q- 21 21 —2o 


fi. 


This formula not only clearly identifies and distinguishes the dependence of the 
interpolating polynomial on the function values from the dependence on the inter- 
polating points, it will also make the generalization to higher-degree interpolating 
polynomials, to be undertaken momentarily, much easier. 


EXAMPLE 5.1 Linear Interpolation from Thermodynamic Tables 


A thermodynamics student needs to determine whether Freon-12 under a pressure 
of P = 400 kilopascals (kPa) and with a specific volume (volume per unit mass) 
of v = 0.042 m3/kg is in a saturated or superheated state. The answer to this 
question depends upon how the specific volume of v = 0.042 m3/kg compares with 
the specific volume of saturated Freon-12 vapor, ug, at a pressure of 400 kPa. If the 
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given vapor pressure is below Ug then the Freon-12 is in a saturated state; otherwise 
it is in a superheated state. 

The available thermodynamic tables (Table A.2.3 of Fundamentals of Classical 
Thermodynamics by Van Wylen and Sonntag) provide the following values for the 
specific volume of saturated Freon-12 vapor as a function of pressure. 


Pressure (kPa) vy (m3/kg) 
362.6 0.047485 
423.3 0.040914 


Linear interpolation can be used to approximate the needed specific volume. The 
linear interpolating polynomial based on the values given in the table is 


P) = -__*“_.9, _P — 362.6 
U9(P) 360.6 — 4233 0.047485 + 39-3006 0.040914. 
Evaluating this polynomial at P = 400 kPa gives 
—23.3 37.4 
400) = ———__ 9, oe Zip 
¥9(400) = Ser q— ayy | 0.047485 + Toes — gas g 0.040914 = 0.043466, 


which is larger than the specified value for the specific volume. Hence, Freon-12 
under a pressure of P = 400 kPa and with a specific volume of v = 0.042 m?/kg is 
in a saturated state. 


Interpolation by Higher-Degree Polynomiais 


If more than two data points are available, a higher-degree interpolating polynomial 
can be computed. Suppose that n +1 data points are available. Each data point 
translates into a single interpolation condition; therefore, with n + 1 points, there 
wil! be 1+ 1 interpolation conditions, which will allow for the determination of 
n+1 polynomial coefficients. Since an nth-degree polynomial has n +1 coefficients 
(one for each power of the independent variable, plus the constant term), it follows 
that n+ 1 data points can determine a polynomial of degree at most n. 

To determine this polynomial of degree at most n, one could apply the in- 
terpolating conditions to produce a system of linear equations for the coefficients 
of the polynomial. Though a natural process to attempt, this approach is very 
curabersome and time consuming. A more efficient scheme for obtaining the inter- 
polating polynomial can be uncovered by making a close examination of the final 
formula given above for the linear interpolating polynomial. In particular, pay close 
attention to the coefficients of the function values. 


Lm L) et 7) 


PiGe = —— fo + ——— f 
Lo — 7 Z1— Xo 
—— ee ee 
polynomial polynomial 
degree one degree one 
@ x = 20, value = 1 @ x = Zo, value = 0 


@ xe = 2, value = 0 @ xg = 21, value = 1 
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Note that these coefficients are polynomials of the same degree as the overall inter- 
polating polynomial. Furthermore, the coefficient of fo evaluates to 1 at © = 20, 
the abscissa associated with the function value fp, and evaluates to zero at the 
other abscissa. A similar result holds for the coefficient of f;: The value is 1 at 
the abscissa associated with f1, but is zero at the other abscissa. These coefficient 
polynomials are called Lagrange polynomials and are denoted by 

w- Ty 


ens 
Ly o{z) = mare and Ly 1 (2) = ee 


The first subscript indicates the degree of the polynomial, while the second indicates 
the associated interpolating point. With this notation, P,; can be expressed very 
compactly as 


P(x) = Li ole) fo + Lis (a) fr = So Lr a(x) fi. 
i=0 


The simplicity of this representation suggests obtaining higher degree inter- 
polating polynomials by generalizing the notion of a Lagrange polynomial. 


Definition. The LAGRANGE POLYNOMIAL L,,;(z) has degree n and is asso- 
ciated with the interpolating point z; in the sense 


aff t=] 
Ln,j (24) =} 0, 2. : 


With this family of functions, it is straightforward to demonstrate that 


interpolates the data (z;,f;) for 7 =0, 1, 2,..., 2. For each x; 


Pe Li) = Lnji Lj fi 
(x3) > (z;) 
Lifi=j, 


0 otherwise 
SOiefp pote Or Fyax ls fp Fe Os Spa ee Oe In 
= fy. 


Since P,(z) = )*?_4 Lni(x) fi is based on Lagrange polynomials, P,, is referred to 
as the Lagrange Form of the Interpolating Polynomial. 

The final piece needed to construct the Lagrange form of the interpolating 
polynomial is to obtain explicit formulas for the L,,;. Fortunately, these can be 
determined by directly applying the conditions stated in the definition. Since Ly; 
is an nth-degree polynomial with n roots located at 2 = a; (i #3), it follows that 
Dn,j Must be of the form 


e(z — ag)(% — 23) ++ (@ — 25-1)(@ — 2541) +++ (@ - En) 
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Figure 5.1 Two of the Lagrange polynomials defined by the sequence 
of seven interpolating points zo = 0.0, 21 = 1.6, a2 = 3.8, 23 = 4.5, 
Za = 6.3, 23 = 9.2, Ze = 10.0. The circles are located along the z-axis 
at the locations of the interpolating points. 


EXAMPLE 5.3 Interpolation from Thermodynamic Tables, Revisited 


Reconsider the problem of determining the specific volume of Freon-12 vapor under 
a pressure of 400 kPa, this time using a higher degree interpolating polynomial. 
The third-degree polynomial based on the four points 


Pressure (kPa) 308.6 362.6 423.3 491.4 
9 (m?/kg) 0.055389 0.047485 0.040914 0.035413 


is given by 
(P — 362.6)(P — 423.3)(P — 491.4) 
(—54)(—114.7)(—182.8) 
(P — 308.6)(P — 423.3)(P — 491.4) 
(54)(—60.7)(—128.8) 
(P — 308.6)(.P — 362.6)(P — 491.4) 
(114.7)(60.7)(—68.1) 
(P — 308.6)(P — 362.6)(P — 423.3) 
(182.8) (128.8)(68.1) 


v(P) = - 0.055389 
- 0.047485 


- 0.040914 


- 0.035413. 
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for some constant c. The final condition of the definition, Ln j(tj) = 1, determines 
the value of c: 


= 1 
ate (rj - o)(zj ~ 2) --- (2; - €5—1)(@j — Dj4r) ++ (ty a,) 
Therefore, 

Ge) a (Eta 1) — apa lo apn) (oa) 
Ln = e) vi) n 

a) (24 ~ 2o}(aj ~ 24) --- (ey — @y-1)(@j — Bj 41)--- (aj — Zn) 

os LZ X 
soe 


ss 
EXAMPLE 5.2 Lagrange Polynomials 


Consider the following seven interpolating points: 
%p = 0.0, 2) =16, 22=3.8, 239 =4.5, ro = 6.3, vy = 9.2, xe = 10.0. 
Based on these points, two of the Lagrange polynomials are 


x(x ~ 3.8)(# ~ 4.5)(a ~ 6.3)(x — 9.2) (2 — 10.0) 
1.6(1,6 ~ 3.8)(1.6 — 4.5)(1.6 — 6.3)(1.6 — 9.2)(1.6 — 10.0) 


Dei (x) = 


and 


oe a(t — 1.6)(x — 3.8)(2 — 6.3) (x — 9.2)(x — 10.0) 
oe 4.5(4.5 — 1.6)(4.5 ~ 3.8)(4.5 — 6.3)(4.5 — 9.2)(4.5 — 10.0)’ 


These two polynomials are plotted in Figure 5.1. The circles are located along the 
z-axis at the positions of the seven interpolating points. 


Note the large amplitude oscillations present in Figure 5.1 in Le3. This 
type of behavior is typical with high degree polynomials and tends to get worse 
as the degree of the polynomial is increased. Because of this behavior, whenever 
high degree polynomials are used for interpolation, some sort of consistency check 
needs to be performed. This could involve simply plotting the data values and the 
interpolating polynomial on the same graph. This would allow a visual verification 
of the extent to which the behavior of the polynomial matches that of the underlying 
data. As an alternative approach, several of the data points could be held in reserve, 
that is, not used to compute the interpolating polynomial. The differences between 
the reserved function values and the values of the interpolating polynomial at the 
abscissas of the reserved points would then serve as a measure of interpolation 
accuracy. 
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Figure 5.3 Eighth-degree interpolating polynomial for emittance of 
tungsten as a function of temperature. Each data point used to construct 
the polynomial is denoted by *. 


Uniqueness of the Interpolating Polynomial 


In subsequent sections of this chapter, other forms for the interpolating polynomial 
will be considered. The following theorem shows that given n+1 distinct abscissas 
and the corresponding function values, there is only one polynomial of degree at 
most n which interpolates the data. This polynomial can, of course, be written 
in different ways, each having its own advantages and disadvantages, but they all 
represent exactly the same function. 


Theorem. If zp, 21, 22, ..., Zn are 2 +1 distinct points and f is defined 
at %o, £1, £2, ..., £n, then there exists a unique polynomial, P, of degree at 
most n such that P interpolates f; that is, 


P(x4) = fla) 
for each i = 0, 1, 2,..., 2. P is called the INTERPOLATING POLYNOMIAL. 


Proof. (1) Existence 
This part of the proof is easy. Since the points zo, 41, £2, ..., Lp are distinct, 
the polynomial 


Pie)" Lastalh 
7=0 


interpolates the data—it is precisely the Lagrange form of the interpolating 
polynomial. Existence has been established. 


3415 Chapter 5 Interpolation (and Curve Fitting) 


0.06 - 


o.osst xe 


\ 
N, 
3 _ 
taal ~ 
g ‘, 
3 ~ 
2 — 
8 0.045 ae 
o 
he 
Ss, 
nee 
“e., 
0.04} 
ree, 
ae 
0.035 —_—— L Nye 


A ae 1 
300 320 340 360 wo 400 420 440 450 480 590 
Absolute pressure {kPa) 


Figure 5.2 Third-degree interpolating polynomial for specific volume 


as a function of absolute pressure. Each data point used to construct 
the polynomial is denoted by *. 


This polynomial, which is plotted in Figure 5.2, clearly provides a plausible 
representation for the data. Using this polynomial, the specific volume of saturated 
Freon-12 vapor at a pressure of 400 kPa is 0.043199 m°/kg. 


EXAMPLE 5.4 Emittance of Tungsten as Function of Temperature 


The table below gives experimental values for the emittance of tungsten as a func- 
tion of temperature. , 


Temperature (K) Emittance Temperature (K) Emittance 
300 0.024 800 0.083 
400 0.035 900 0.097 
500 0.046 1000 0.111 
600 0.058 1100 0.125 
700 0.067 


The eighth-degree polynomial which interpolates this data is plotted in Figure 5.3. 
Note that the behavior of the polynomial, from around 300-500 K and 900-1100 K, 
does not appear to be consistent with the underlying data. The use of a spline 
interpolating function (Section 5.6) would be advisable for this problem. 
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(2) Uniqueness 

This part of the proof will proceed by contradiction. Suppose that P and Q 
are different polynomials of degree at most n which interpolate f at the n+1 
distinct points x9, 21, 22, ..., Z,- Consider the function h(x) = P(x) - Q(z). 
Since P and Q are both polynomials of degree at most n, h is also a polynomial 
of degree at most n. Furthermore, since P and Q both interpolate the same 
data, it follows that 


h(zi) = P(zi) - O(a) =fi-fi = 0 
for each 1 = 0, 1, 2,..., n. Therefore, h is a polynomial of degree at most 
n with n +1 roots. The Fundamental Theorem of Algebra guarantees that 
the only way this can happen is if h(z) = 0. This implies that P = Q, 


which contradicts our assumption. Hence, the interpolating polynomial is 
unique. O 


Interpolation Error 


One of the major benefits of the uniqueness theorem is that it allows for a general 
discussion of interpolation error. Since the various forms of the interpolating poly- 
nomial are just different ways of writing the same function, interpolation error does 
not depend on the form selected for the interpolating polynomial, and there is no 
need to treat each form separately. 

Theorem. If 29, x1, £2, ---; Yn are N+ 1 distinct points in [a,}] and f is 

continuous on [a,b] and has n + 1 continuous derivatives on (a,b), then for 

each x & [a,b| there exists a &(z) € [2,6] such that 

pete) 
F(a) = Pla) + yyy (@ ~ Bo)(0 ~ 22) — 92) -~(@ — Bn), 
where P is the interpolating polynomial. 


Proof. First note that since P(z;) = f(z:) by the interpolation conditions 
and since the term involving f(t?) contains the factor (x — 2;), the error 
formula holds for each abscissa, z = z;. For all other x € |[a, 5], consider the 
auxiliary function 


oft) =F) — PO) -F@) - P@T 


i=0 


By hypothesis, f has 2 +1 continuous derivatives on (a,b). Since P and 
Ths es are polynomials, they possess infinitely many continuous deriva- 
tives on (a, b). By construction, then, g has n + | continuous derivatives on 


(a, 6). Furthermore, 
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for each j = 0, 1, 2,..., m, and 


Lz 


ene eae 
= f(x) - P(x) - [f(a) - P(2)| 1 =0. 


g therefore has n + 2 roots on [a,b]. Applying the generalized Rolle’s the- 
orem (see Appendix A), it follows that there exists €(x} € {a,b] such that 
(n+l) (¢) = 9 
gerr'(é) = 0. 
Now, P isa polnoniel of degree at most n, so P("+))(t) = 0. On the 
other hand, []j_ =0 57, +=%: is a polynomial of degree n +1 with leading coefficient 


[T1o(# - 24)", 80 
m+t1 n bees n -1 
aes [ited = (n+ 1)!- The-=o] : 
4=0 : 


i=0 


Differentiating g n +1 times and evaluating at € then gives 


i=0 


0 = gE) = fOFN(E) -0- [Ff (2) — Pla)|(n + 1)! 


Solving this equation for f(x) yields the desired error formula: 


fla) = Pia) + EO) 


Cea! (x —29)(a — 21)(z — £2) +++ (2 — Sq). Oo 


Advantages and Disadvantages of the Lagrange Form 


Each form of the interpolating polynomial that will be studied will have its own set 
of advantages and disadvantages. One of the advantages of the Lagrange form of 
the interpolating polynomial is the simplicity of its derivation. Since the Lagrange 
form isolates the dependence of the interpolating polynomial on the function val- 
ues, this form is useful when the abscissas are fixed, but the corresponding function 
values are changed often. The greatest advantage of the Lagrange form, however, 
is its theoretical value. Almost all of the techniques developed in later chapters 
which are based on interpolation start from the Lagrange form of the interpolat- 
ing polynomial. On the negative side, if more data become available, the work 
performed to generate the original Lagrange form cannot be reused to compute a 
higher-degree polynomial. Work must begin from scratch. Finally, the Lagrange 
form of the interpolating polynomial is very cumbersome for common polynomial 
operations such as evaluation, differentiation and integration. 


EXERCISES 
1, Let zo = —1, 2; =1 andz2 =2. 


(a) Determine formulas for the Lagrange polynomials Lao(z), Lai(r), and 
L2,2(z) associated with the given interpolating points. 


350 


Chapter 5 Interpolation (and Curve Fitting) 


(b) Plot Lo,o(x), L21(x), and La,o(x) on the same set of axes over the range 
[-1, 2]. 


. Let to = ~3, 1 = 0, 22 =e and 23 = 7. 


(a) Determine formulas for the Lagrange polynomials L3,o(2), L31(2), L3,2(z) 
and L3,3(z) associated with the given interpolating points. 


(b) Plot L3,o(%), L3,1(z), L3,2(x) and L3,3(x) on the same set of axes over the 
range [—3, 7]. 


. Let xo = 0.0, 21 = 1.6, 22 = 3.8, 73 = 4.5, 24 = 6.3, 25 = 9.2, and ae = 10.0. 


(a) Determine formulas for the Lagrange polynomials Le,o(z), Le,2(x), and 
L6,5(z) associated with the given interpolating points. 


(b) Plot Leo(x), Le,o(x), and Le,5(z) on the same set of axes over the range 


7 


. Consider the function f{x) = Inz. 


(a} Construct the Lagrange form of the interpolating polynomial for f passing 
through the points (1, In 1), (2, In2), and (3, In3). 

(b) Plot the polynomial obtained in part (a) on the same set of axes as f(r) = 
Inz. Use an z range of (1, 3]. Next, generate a plot of the difference between 
the polynomial obtained in part (a) and ffir) = Ina. 

(c) Use the polynomial obtained in part (a) to estimate both In(1.5) and In(2.4). 
What is the error in each approximation? 

(d) Establish the theoretical error bound for using the polynomial found in part 
(a) to approximate In(1.5). Compare the theoretical error bound to the error 
found in part (c). 


. Consider the function f(z} = sina. 


(a) Construct the Lagrange form of the interpolating polynomial for f passing 
through the points (0, sin 0), (4/4, sin +/4), and (7/2, sinx/2). 

(b) Plot the polynomial obtained in part (a) on the same set of axes as f(z) = 
sing. Use an x range of [0.7/2]. Next, generate a plot of the difference 
between the polynomial obtained in part (a) and f(x) = sing. 

(c) Use the polynomial obtained in part (a) to estimate both sin(7/3) and 
sin(7/6). What is the error in each approximation? 

(d) Establish the theoretical error bound for using the polynomial found in part 
(a) to approximate sin(7/3}. Compare the theoretical error bound to the 
error found in part (c). 


. Consider the function f(x) = e*. 


(a) Construct the Lagrange form of the interpolating polynomial for f passing 
through the points (—1, e~?), (0, e°), and (1, e?). 

(b) Plot the polynomial obtained in part (a) on the same set of axes as f(x) = 
e”. Use an « range of [—1, 1]. Next, generate a plot of the difference between 
the polynomial obtained in part (a) and f(a) = e*. 

(c) Use the polynomial obtained in part (a) to estimate both \/e and en, 
What is the error in each approximation? 

(d) Establish the theoretical error bound for using the polynomial found in part 
(a) to approximate ./e. Compare the theoretical error bound to the error 
found in part (¢). : 
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7. Consider the data set 
xz -1 0 1 2 
y 5 1 oa di 
(a) Show that the polynomials f(z) = x? + 2x° — 3a +1 and g(x) = gre + 
32° + Ba? oa iy +1 both interpolate all of the data. 
(b) Why does this not contradict the uniqueness part of the theorem on existence 
and uniqueness of polynomial interpolation? 
8. Consider the data set 
gz —3 1 2 5 
yi 288 SUL 88 1 
(a) Show that the polynomials f(x) = 2° — 32? — 102 +1 and g(x) = —23+ 
3(x — 3) — 3(a + 3)(x — 1) + (x + 8)(x% — 1)(z — 2) both interpolate all of the 
data, 
y does this not contradict the uniqueness part of the theorem on existence 
(b) Why d hi: adict the uni f the th j 
and uniqueness of polynomial interpolation? 


9. Suppose that f is continuous and has continuous first and second derivatives 
on the interval [9,21]. Derive the following bound on the error due to linear 
interpolation of f: 


=n? max |f"(2)|, 


8 x€[zo,21] 


\f(z) — Pi(x)| < 


where h = x1 — Zo. 

10. The interpolation points influence interpolation error through the polynomial 

m9(2 — z;). Suppose we are interpolating the function f over the interval 

[~1, 1] using linear interpolation. 

(a) If zo = —1 and x; = 1, determine the maximum value of the expression 
(2 -—2o)(@ -21)| for -l<a<1, 

(b) If zo = —V2/2 and x, = V2/2, determine the maximum value of the 
expression |(% —%9){#—%1)| for —-1 < a < 1. How does this compare to the 
maximum found in part (a)? 

(c) Select any two numbers from the interval [—1, 1] to serve as the interpolation 
points zo and 21. Determine the maximum value of the expression |(# ~ 
xo)(% — 2,)| for -1 < x < 1, and compare to the maxima found in parts 
(a) and (b). 

11. The interpolation points influence interpolation error through the polynomial 
TTiio(z — x4). Suppose we are interpolating the function f over the interval 
{-1, 1] using quadratic interpolation. 

(a) If zo = -1, 21 = 0 and zz = 1, determine the maximum value of the 
expression |(a — x0)(x — 21)(z — z2)| for -1 <2 <1. 

(b) If cp = —V3/2, 2; = 0 and x2 = V3/2, determine the maximum value of 
the expression |(x ~ xo)(z — 41)(x — £2)| for —1 < x < 1. How does this 
compare to the maximum found in part (a)? 

{c) Select any three numbers from the interval [—1, 1] to serve as the interpola- 
tion points xo, x1 and x2. Determine the maximum value of the expression 
|(z — xo)(a — £1)(x — x2\| for -1 < x < 1, and compare to the maxima 
found in parts (a) and (b). 
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16. 
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The following data set was taken from a polynomial of degree at most five. Find 
the polynomial. 
xz -2 -1 QO 1 2 3 
y 389 3 -1 ~3 -9 -1 
Consider the data set 
ze 0 1.25 185 240 3.05 3.64 425 485 5.45 
y O 4 6 8 10 12 14 16 18 
Determine the polynomial of degree at most eight (8) which interpolates this 
data. Over what range of x values would you feel comfortable using the interpo- 
lating polynomial to approximate values of y? Explain. 
A thermodynamics student needs the temperature of saturated steam under a 
pressure of 6.3 mega-Pascals (MPa). 
(a) Estimate the temperature using linear interpolation from the data 


Pressure (MPa) Temperature (°C) 
6.0 275.64 
7.0 285.88 


(b) Estimate the temperature using polynomial interpolation from the data 


Pressure (MPa) 4.0 5.0 6.0 7.0 8.0 9.0 
Temperature (°C) 250.40 263.99 275.64 285.88 295.06 303.40 


(c) Which approximation do you think is more accurate and why? 


Perry’s Chemical Engineer’s Handbook gives the following values for the heat 
capacity at constant pressure, cp, of an aqueous solution of methyl alcohol as a 
function of the alcohol mole percentage, ¢: 

@ (%) 5.88 123 27.3 45.8 69.6 100.0 

Cp (cal/g °C) 0.995 0.98 0.92 0.83 0.726 0.617 
All data are provided at T = 40° C and atmospheric pressure. A table that lists 
the heat capacity at constant pressure for @ = 5, 10, 15, ..., 100% is desired. 
The table below lists the linewidth of a printed feature on a semiconductor device 
as a function of the dissolution time (the amount of time the silicon wafer is placed 
in the developer solution). 

Dissolution Time (sec) 10 12 14 16 18 20 


Linewidth (ym) 0.25 0.36 0.45 0.50 0.53 0,5 
(a) Approximate the linewidth of the feature after a dissolution time of 15 sec- 
onds. 


(b) Plot the values in the table, together with the value obtained in part (a). 
Does the result from part (a) seem reasonable? Explain. 
The following table gives the viscosity, in milli-Pascal-seconds (centipoises) of 
sulfuric acid as a function of concentration, in mass: percent. 
Concentration 0 20 40 60 80 =.:100 
Viscosity 0.89 140 2.51 5387 174 24.2 
Determine the polynomial] of degree at most five which interpolates this data. 
The viscosity of sulfuric acid with a 5% concentration is 1.01 and with a 10% 
concentration is 1.12. Use these values to assess the accuracy of the interpolating 
polynomial. 
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5.2 NEVILLE’S ALGORITHM 


Having established the fundamental concepts behind polynomial interpolation, are 
two important questions need to be addressed. First, what is the best way to com- 
pute the value of the interpolating polynomial for just one value of the independent 
variable? The answer to this question, which is to be studied in this section, gives 
rise to what is known. as Neville’s algorithm. Second, what is the best way to ob- 
tain the interpolating polynomial explicitly, whether this is for evaluation or for 
manipulation? This question leads to the development of divided differences and 
the Newton form of the interpolating polynomial, topics which will be considered 
in Section 5.3. 


Notation 


The following notation will greatly simplify the discussion of Neville’s algorithm. 


Let mj , M2, m3, ..., ms, be k distinct integers between 0 and n, inclusive. Denote 
the unique polynomial of degree at most k—1 which interpolates data at the points 
Ey, Crass Linge ++ Cony bY Pmyma,mg,...m,(Z}- The m’s are usually listed in 


ascending order, though this is not necessary. Figure 5.4 provides an illustration 
of this notational scheme. Arrows indicate the points at which data is interpolated 
by each polynomial. For example, P\,2,4(z) interpolates the data at the abscissas 
Z1, 22, and Za. 


Linear P(x 


Constant P(x) 


a 
— 
A 


Quadratic A, ,(x) _ { 


Figure 5.4 Illustration of notation used for the development of 
Neville’s algorithm. 


Constructing Higher-Degree Polynomials from Lower-Degree Polynomials 


Neville’s algorithm is a procedure designed to efficiently determine the value of 
the interpolating polynomial at a single value of the independent variable. In the 
process, an explicit formula for the interpolating polynomial is not generated, just 
the value of the polynomial. The key element in the algorithm is the fact that 
the interpolating polynomial through a given set of n points can be obtained by 
combining two polynomials that interpolate different sets of mn — 1 of those points. 
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For instance, the interpolating polynomial Fo,1,2,3 can be constructed by combining 
the polynomials Pp 1,3 and P,.2.3 in a special way. 

To determine this special way of combining lower degree interpolating poly- 
nomials, consider the case of linear interpolation. From the previous section, it is 
known that the polynomial which interpolates data at z — gq and 2 = 2, Po; in 
the current notation, is given by 


Z— 2X, xt — Xo 


Poi (x) = 


fot+ 


T)— 2 Z1 — Xo 


fi. 


First note that the function values fy and jf; are the values of two constant in- 
terpolating polynomials, In particular, fo = Po(x) and f; = P\(x). Making these 
substitutions into the equation for Po, and rearranging terms, the linear polynomial 
Po, can be written in the form 


Does not interpolate Does not interpolate 
at X = Xo Ae ve at X= x4 


x-x,)P(x)-(x-x,)JRB(x 
B, (x) = aa) R2)=(¢= 4) RG) 
x — XQ 
x-coordinate of point x-coordinate of point 
not interpolated by P; difference of not interpolated by Pp 


x-coordinates 


Focus on the manner in which the two lower-degree polynomials have been 
combined. Note that the polynomial P; does not interpolate at 2 = a and it is 
multiplied by x — zg. Similarly, the polynomial Py does not interpolate at x = x 
and it is multiplied by a — 2. Hence, each of the lower-degree polynomials is 
maultiplied by a monomial of the form x minus the x-coordinate of the point not 
interpolated by that particular polynomial. These two terms are then subtracted, 
and the construction of the higher-degree interpolating polynomial is completed 
by dividing by the difference between the abscissas which appear in the coefficient 
monomials. The order in which the abscissas appear in the denominator is the 
reverse of the order in which they appear in the numerator. 

This result with the linear interpolating polynomial suggests that the previ- 
ously stated problem of constructing Po,1,2,3 from the polynomials 1,3 and Pi,2,3 
can be carried out as follows. Since Po,1,3 does not interpolate at x = x2, multiply 
this polynomial by (% — x2). Further, since Pi,2,3 does not interpolate at x = x9, 
multiply this polynomial by (s—29). Deciding to subtract the term involving P\,2,3 
from the term involving Fo,1,3 fixes the order of the abscissas in the denominator 
as yp — Lg. Therefore, 


(a — x2) Po,1,3(a) — (x - 2) P1,2,3(t) 
Xp — LQ 


Po,1,2,3(2) = 
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The next theorem establishes that this scheme for combining lower-degree interpo- 
lating polynomials, which was pieced together from an examination of the linear 
interpolant, does, in fact, hold in general. 


Theorem. Let Zp, 21, 2, ..., Zn be n+ 1 distinct abscissas; let fo, fi, fo, 

.., fn denote the corresponding function values being interpolated; and let 
mi, M2,M3,..., Mp be k distinct integers between 0 and n, inclusive. Then, 
fork =1, Pm, {z) = fm,, and for k > 1, 


Pa Hig ti (x) ot e = Bam )Fimaymay ns (2) = a = Erm) Pr rays (2) . 
lm, — Lm, 

Proof. For k = 1, Pn,(c) = fm, follows from the definition of P,, (x) as 

the unique polynomial of degree 0 that interpolates at z= 2m,. For k > 1, 

consider the polynomial 


em Zany) Prag smgyma yk (t) —(#~ Dry) Pray ra tiis een (x) 


Em, — Lm, 


P(z)= 


First note that Pry.mg,ma,....mz 200 Pm, ,mg,ma,...my-, are polynomials of de- 
gree at most k—2 (each interpolates data at k—1 points), so P is a polynomial 
of degree at most k — 1. Since Pmy,mgym4,..,mx 200 Pry mg,mg,....m,—, both 
interpolate data at the set of abscissas tm,, 22mg, 2may --+) Lmyz_,, it follows 
that for each i = 2, 3,4,...,k-—1 


(2m, — Ling ) Png mg,...,mx (Lm) ~ (Sm, = Lrg) Prva ng,....ran—1 (Lens) 


P(tm,) = 
(tm,) ee 
= (Zim ex Lm) Fm: = (tm: == Linx) Sma 
Ling — Erry 
= fig 
Furthermore, 
Pla ) = (Zing = Lm) Pg mg,...mx (Lm; ) a (2m = Lrg) Pan ma yo--sme—1 (Lm) 
yt Lm, — Lm, 
= (0) Prog mg,...imy (Lm) — (Sm, — Sng) Fon 
Im, — im, 
= Sm, 
and 


(2m, — Limi) Pina sms,...tae (Lrg) — (Seg 7 Cm) Pema ma,...cran—1 (ms) 
P(2my) = Pa 
Mk my, 
_ (mi, s my) Smeg a (0) Prat,ma,...me—s (Lm) 


Lm, — lm, 


= fine: 


356 Chapter 5 Interpolation (and Curve Fitting) 


Hence, P(%m;) = fm, for each i = 1, 2, 3,..., k. But, Pj ,mg,ma,...,m, 18 the 
unique polynomial of degree at most k — 1 which interpolates data at the set 
of abscissas 2m,, 2m) Lmg; «++, m,+ Therefore, 


Pry ,ma,mg,.-.,my (2) 


= (2— Tm; )Praa.ms,m4,....mx (x) - (x = Ling) Ps masms,..yme—1 (x) 


Oo 
Lm, — Lm, 
The Algorithm 
The problem at hand is that of evaluating the interpolating polynomial at a single 
point. Problem data consist of n +1 distinct abscissas ro, 7), 2, ..., Zp, a Corre- 


sponding set of function values fo, f1, fo, ..-, fn, and one value of the independent 
variable, Z, at which the interpolating polynomial is to be evaluated. Recall that 
the function values can be interpreted as a collection of constant (zeroth degree) 
interpolating polynomials. Starting from these values, the previous theorem, with 
x replaced by %, can be used to compute the value of a set of linear polynomials, 
from which the value of a set of quadratic polynomials can be computed, and so on 
until the value of the polynomial with highest possible degree has been determined. 
All calculations for Neville’s algorithm can be conveniently organized into a table: 


Xo fy = F(X) 

x f =F(z) F(x) 

Xo f=PhG) RAE F.2(%) 

Xs f,=P() Abe). ee a Pys.2,3(%) 
e P,, (a) = Fal)“ BR a) 


X3—%X, 


Here is an example to demonstrate the construction of this Neville’s algorithm 
table. 


i eS 
EXAMPLE 5.5 An Arbitrary Set of Data 


Based on the following data, approximate the value of y when x = 1.5. 
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The Neville’s algorithm table based on this data is given below. From this table, 
we see that when x = 1.5, the value of the interpolating polynomial that uses all 
four data points is 4.375. 


go=—l Po(l.5)= 

4, =0 Py(1. cee Poi (1.5) = -5 

%2> 1 P,(1. 5)= =1 P, (1. 5) =] Po12(1.5) = 2.5 

3 = 2 P3(1.5) = = 11 Pea 5) =6 Py 2,3(1.5) = 4.75 Po,1,2,3(1.5) = 4,375 
So where did all of these numbers come from? The first two columns are just the 
given abscissas and function values. The values in the remaining three columns 


were computed as follows. Each number gives the value of a different interpolating 
polynomial when x = 1.5. 


(1.5 = 20) Pi (1.5) = (1.5 = £1) P(1.5) 


1.5) = 
Po,1 (1.5) Aen 
_25X1-15x5 
; se 
i 1.5 - P,(1. 
Py (1.5) (1 5+ £1) Pal (1. 5) - ( 5 29) 1 5) 
ty —- 2 
2 Je leon A 
2 ; ea 
1.5 - P3(1.5) — (1.5 - P2(1.5 
ee (1.5 — x2) P3(1.5) — ( £3) P2(1.5) 
£3 ~- £2 
_ 0.5 x 11—(-0.5) x1 = 
a ar 
055 £5) 20 seach: 
Para(l.b) = (1.5 — 0) Pt,2(1.5) — ( %2)Po,1 (1.5) 
2 -— 2Xo 
_25x1-05x-5 4. 
2 
1.5 - Po3({1.5) — (1.5 -— Pz 9(1.5 
Pyo3(1.5) = (1.5 ~ 1) Po,3(1.5) — (1.5 ~ 23) Pi,2(1-5) 
23-2) 
_ 15 x6 08) x1 475 
j- 1.5) — (1.5 - Pi 1. 
Peay (1.5 — x0) Pi,2,3(1.5) — ( 3) Po,1,2(1.5) 
X3-~- Xo 
_ 2.5 x 4.75 — (—0.5) x 2.5 


= 4.375 

3 
To assess the reasonableness of this approximation, Figure 5.5 plots the given data 
points (represented as circles), the interpolated point (the asterisk), and the inter- 
polating polynomial. The behavior of the polynomial is plausible, giving confidence 
that the approximation is reasonably accurate. 
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Figure 5.5 Data points (circles), interpolated point (asterisk), and 
interpolating polynomial for the data in the “An Arbitrary Set of Data” 
Example. 


Although the algorithm just described constructs the Neville’s table one col- 
umn at a time, the table can also be constructed one row at a time. One advantage 
of working in a row-wise manner is that if additional data becomes available, then 
there is no need to start from scratch to compute the value of the interpolating 
polynomial which incorporates the new data as well as the old. As long as the last 
row from the Neville’s table has been saved, all that is needed is to compute one 
new row for each of the new data points. 


Some More Notation 


An alternative notation, one that more closely reflects the matrix-like structure of 
the Neville’s table, is typically introduced. For 4 > j, let Qi,j3 = Py—ja—g41)4-542,..48- 
Note that the first subscript on Q;,; indicates the index of the last point interpo- 
lated, the second subscript denotes the degree of the polynomial and Q;,; inter- 
polates data from # = 2,_; through + = 2. In terms of this new notation, the 
constant, (zeroth degree) interpolating polynomials, Q;9, are given by the function 
values f; (i = 0,1, 2,..., 7); the formula for constructing higher-degree polynomi- 
als from lower-degree polynomials takes the form 


Qi5(£) = (a — ti-3)Qi,j—1(@) — (@ — Fi) Qi-1,5-1(2) 
ag os , 


Li Vij 


and the Neville’s table appears in Figure 5.6. 
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a) Sy = Q,0(*) 

xy ff = Q, o(X) Q,,,(%) 

x2 Ff, = Q:.o() Q,..(X) 

x3 i= Q, o(¥) al) Q, (x) 
° 0, (z)= (¥-x,)O, ,(X) -(¥-»5)Q, (X) 


x3 — x, 


Figure 5.6 Organization of a Neville’s Table using Q;,;-notation. 


EXAMPLE 5.6 Relative Viscosity of Ethanol as a Function of Anhydrous 
Solute Weight 


The table below lists the relative viscosity, V, of ethanol as a function of the percent 
of anhydrous solute weight, w. 


w 10 20 40 60 80 100 
V 1498 2.138 2.840 2.542 1.877 1.201 


An estimate of the relative viscosity when w = 50 is needed. The Neville’s table 
generated from this data is provided below. All values have been rounded to four 
decimal places. 


20 «62.138 4.058 

40 2.840 3.191 2.902 

60 2.542 2.691 2.816 2.8332 

80 1.877 2.8745 2.7369 2.7764 2.8008 

100 1.201 2.891 2.8704 2.7591 2.7699 2.7871 


The relative viscosity of ethanol with a 50% anhydrous solute weight is approxi- 
mately 2.7871. Figure 5.7 is used to assess the accuracy of this result. Based on this 
graph, the approximate value for the relative viscosity seems perfectly reasonable. 


EXAMPLE 5.7 Probability of a Shutout in Racquetball 


The following table (drawn from Joseph Keller, “Probability of a Shutout in Rac- 
quetball,” SIAM Review, 26, 267-268, 1984) gives the probability, P, that a given 
player will shutout an opponent as a function of the probability, p, that the player 
will win any particular rally regardless of who serves. 
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Figure 5.7 Data points (circles), interpolated point (asterisk), and 
interpolating polynomial for the relative viscosity of ethanol as a function 
of percent of anhydrous solute weight. 


0.5 0.0001504 

0.84 0.490 0.1442 

0.842 0.500 —0.71 —0.1055 

0.85 0.534 —0,5285 -—5.066 —1.5228 

0.9 0.753 —0.561 —0.3929 -23.75%4  -—7.0817 
1.0 1.0 0.0120 —1.516 = 1.8278 ~61.3870 


TABLE 5.1: Neville’s table for Example 5.7. 


—17.9428 


p P 

1.0 1.0 

0.9 0.753 
0.85 0.534 
0.842 0.500 
0.84 0.490 

0.5 0.0001504 


Suppose that a player estimates a 60% chance of winning any particular rally against 
a given opponent, regardless of who serves. What is the probability that this player 


will shutout that opponent? 


The Neville’s table generated by this data, with entries rounded to four deci- 
mal places is given in Table 5.1. The final value, —17.9428, is clearly inappropriate. 
Valid probability scores must be between zero and one, inclusive. Figure 5.8 offers 
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Probabillty of a shutout, 


L 1 1 a 


1 1 1 1 
OS 0.55 0.6 0.65 O7 0.75 0.8 0.85 09 
Probabllity of winning any rally, > 


Figure 5.8 Data points (circles), interpolated point (asterisk) and in- 


terpolating polynomial for the probability of a shutout in racquetball as 
a function of the probability of winning any particular rally. 


some insight into what has gone wrong: the interpolating polynomial provides a 
poor representation for the data. Over roughly half of the problem domain, the 
interpolating polynomial] predicts invalid probabilities: negative values on the left 
portion of the domain, values larger than one on the right. 


EXERCISES 


1. Indicate how to construct each of the following interpolating polynomials. 
(a) Po,1,2,3(Z) from Po,1,2(x) and Pi,2,3(2) 


(b) Po,1,2,3(2) from Po,2,3¢a) and Po,1,3(2) 
(c) Po,1,2,3(z) from Py,2,3(2} and Po,2,3(z) 
(d) Po,1,2,3(2) from Po,1,3(2) and Po,1,2(2) 
2. Indicate how to construct each of the following interpolating polynomials. 
(a) Po,1,2(2) from P;,2(%) and Po,2(z) 
(b) Pi,3,4,6(4) from Py 4.6(x) and P1,3,6(2) 
{c) Po,2,3,4,7(2) from Po,2,4,7(x) and P2,3,4,7(2) 
(d) Pi,2,3,4,5,6(2) from Pi,2,3,5,6(2) and Pi,3,4,5,6(2) 
3. Construct the Neville’s table for the following data set. Take Z = 3.7. 
za 2 4 
4 


5 
y -l 8 
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4, 


10. 


nine 


12. 


13. 


Construct the Neville’s table for the following data set. Take Z = 1.3. 
%. Or de <2 
a ae 


. Construct the Neville’s table for the following data set. Take # = —0.5. 


x -l1 0 1 2 
y 3 --1 -3 1 


- Construct the Neville’s table for the following data set. Take = = —3, 


xz -? -5§ -4 ~1 
y 10 5 2 10 


. Given zo = 0, 2] = 1, 22 = 2, Po,i(z) = 22 4+ 2 and Poo(x) = 3a + 2, what is 


Po,1,2(2:)? 


- Given zo = —1, 21 = 0, x2 = 1, 23 = 2, Poo(x) = —32, P2,3(z} = 42 — 7 and 


Py,2,3(1.7) = —0.83, calculate Po 2,3(x) and Po,1,2,3(1.7). 


. Determine the missing values in the Neville’s table provided below. 


to=0 Po(1.3) =—-1 
zgy=l (1.8) =? Poi (1.3) = 5.5 
20= 2 P2(1.3) =? Pi,2(1.3) =? Po,1,2(1.3) = 4.915 


Determine the missing values in the Neville’s table provided below. For some of 
the values you will need to work backwards. 

to =O Po(2.5)=1 

ey Sl Ph (2.5) =3 Po,1(2.5) =6 

m2=2 Peo25)=3 Py2(2.5)=? For(2.5) =? 

m3=3 P3(2.5)=? Po3(2.5)=3 Pr23(2.5)=3 Po,1,2,3(2.5) =? 


Use Neville’s algorithm to evaluate the interpolating polynomial for f(x) = Inz 
that passes through the points (1, ln 1), (2,1n2), and (3,!n3) at c= 1.5. 

Use Neville’s algorithm to evaluate the interpolating polynomial for f(z) = sin x 
that passes through the points (0,sin0), (7/4,sin7/4), and (2/2,sin 7/2) at 
r=nf6. 

Use Neville’s algorithm to evaluate the interpolating polynomial for f(z) = e” 
that passes through the points (—1,e7+), (0,e°), and (1,e*) at « = 0.5. 


For Exercises 14-17, use Neville’s algorithm to estimate the requested value(s). Assess 
the accuracy of each estimate by plotting the data points and the estimated point(s) 
on the same set of coordinate axes. 


14. 


15. 


The mean activity coefficient at 25°C for silver nitrate, as a function of molality, 
js given in the table below. Estimate the mean activity coefficient for a molality 
of 0.032 and for a molality of 1.682. 


Molality 0.005 0.010 0.020 0.050 0.100 0.200 0.500 1.000 2,000 
Coefficient 0.924 0.896 0.859 0.794 0.732 0.656 0.536 0.480 0.316 


The values listed in the table provide the surface tension of mercury as a function 
of temperature. Estimate the surface tension of mercury at 20° C and at 60°C. 


Temperature (°C) 10 25 50 75 100 
Surface Tension (dyn/cm} 488.55 485.48 480.36 475.23 470.11 
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16. The thermal conductivity of air as a function of temperature is given in the table 
below. Estimate the thermal conductivity of air when T = 240K and when 
T = 485K. 
Temperature (K) 100 200 ©6300 ©6400 6500 = 600 
Thermal Conductivity (mW/m-K) 9.4 184 26.2 33.3 39.7 45.7 


17. Estimate the viscosity of sulfuric acid with a concentration (in mass percent) of 
7.5% and a concentration of 92% given the following values. 


Concentration (mass %) 0 5 10 20 40 60 80 = 100 
Viscosity (centipoise) 0.89 1.01 1.12 1.40 2.51 5.37 17.4 24.2 


5.3. THE NEWTON FORM OF THE INTERPOLATING POLYNOMIAL 


At the beginning of Section 5.2, two important questions were raised regarding 
polynomial interpolation. The first considered the best way to obtain the value of 
the interpolating polynomial at a single value of the independent variable. Neville’s 
algorithm provided the answer to this question. The algorithm did not actually 
construct the interpolating polynomial but rather generated the polynomial value 
in an incremental fashion. The second question considered the best way to proceed 
when the value of the interpolating polynomial was needed at several values of the 
independent variable. In this case, Neville’s algorithm is no longer efficient. It is, 
instead, better to obtain the interpolating polynomial explicitly in a form efficient 
for evaluation. 


Newton Form of the !nterpolating Polynomial 


The objective of this section is to write Po12,...n(xz), the unique polynomial of 
degree at most n which interpolates data at the n +1 distinct abscissas xo, %1, Z2, 
.., n, in Newton Form: 


Po,1,2,....n(£) = a0 + a1(z — Zo) + Ge(a — 2o)(2@ — 41) + °° 


+ Qn (% — Lo)(2— £1) ---(@ — fn-1) 


ao 


Like the more standard representation of an nth-degree polynomial 
Qo £a,5 + age? +++: + Oz”, 
the Newton form can be written as a “nested iteration,” 
Po,1,2,..4n(B) = a9 + (@ — 29)(a1 + (@ ~ 21) (a2 + (Z- 22){--- 
(2 — &n—2)(An—1 + An(% - 2n-1))))) 


and evaluated by a modified version of the synthetic division algorithm: 
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value := dy 
for i from n ~ 1 downto 0 do 
value := a; + (a ~ 2;) value 


The primary advantage of the Newton form over the standard representation is in 
the computation of the polynomial coefficients ag, a1, G2, ..., On. Determining the 
coefficients for the standard representation requires the solution of a system of n+1 
linear equations. The number of operations needed to solve this system is O(n); 
that is, roughly a constant times n°. The algorithm to be developed presently for 
determining the coefficients of the Newton form requires only O(n) operations. 
For even moderate values of n, this represents a significant computational savings. 


Divided Differences 


To determine the coefficients of the Newton form of the interpolating polynomial, 
each of the interpolating conditions 


Po,1,2,....n(Zk) = f (zx) 


(k = 0, 1, 2, ..., n) must be applied. The special structure of the Newton form 
leads to a system of equations of the form 
ao = f (Xo) 
ap + a1 (a1 — &) = f (21) 
Go + Qy (x9 = Zo) + 2(%q — Lo)(Z2 — 21) = f (xe 


ag + 41(2_ — Lo) + @2(%n — Lo)(Hn — 21) +++ + On(En — Zo)+ ++ (Zn — Bn-1) = oh 
1 

The solution to this system of equations can be obtained rather easily by forward 

substitution (solve the first equation for the single unknown ao, substitute that value 

into the second equation and solve for the only remaining unknown aj, substitute 

the values of ap and a; into the third equation and solve for the only remaining 
unknown Qe, etc.) and can be expressed compactly in terms of divided differences. 


Definition. Let f be a function defined at the distinct points x9, 21, £2, ..., 
Tes 

The ZEROTH DIVIDED DIFFERENCE of f with respect to the point 2, is 
F [xi] = f(a). 

For 0<k <n, the kTH DivipED DIFFERENCE of f with respect to the 
points 2%, Lig, Vida, ---, Ligh iS 


Flts, Ci4r, Cina,» Sieel = fle e41, Dia, B43, Lea] 
F [2s Si41, Vip2s ++) Ter e-a))/ (ire — 2y). 
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Note the way that each kth divided difference with respect to a set of k + 1 
points is constructed from two (k—1)-st divided differences with respect to different 
subsets of k points. This is very similar to the procedure used in Neville’s algorithm 
to construct higher-degree polynomials from lower-degree polynomials. 

Now, on to the solution for the coefficients in the Newton form of the inter- 
polating polynomial. The first equation in (1) gives a9 = f(zo) = f [zo], the zeroth 
divided difference with respect to the first interpolating point z = zp. Substituting 
this value into the second equation and solving for a) gives 


— Fler) — flo) _ fla] — flo) 


the first divided difference with respect to the first two interpolating points. Substi- 
tuting the values for ag and a; into the third equation and solving for ag produces, 
after some tedious algebraic manipulation, 


21,22] — J \L%o, zr 

ary final = Tos ial Seo] = f{to, 1, x9]. 
zq— Lo 

Continuing in this fashion, it is found that 


a= f[xo, 21, 22,-.., 24]; 


and, hence, the Newton form of the interpolating polynomial is 


n k-j 
Pos. dct) = >. flee, eastane 45) (Te — *)) : 
k=0 i=0 

The strong similarity between the formula in the definition of the &-th divided 
difference and the formula used to construct the Neville’s algorithm table suggests 
organizing the calculation of divided differences into a table. Such a divided differ- 
ence table, based on four interpolating points, is shown in Figure 5.9. Note that 
the coefficients of the Newton form of the interpolating polynomial are the values 
at the top of each column. 

With n+ 1 interpolating points, n columns need to be computed to form the 
complete divided difference table. For i = 1, 2, 3, ..., n, the ith column of the 
table contains n+ 1 -— 7% values that need to be computed, each of which requires 
two subtractions and one division. The number of operations to compute the entire 
table is therefore 

nm 
35 0(n+1-i) = — 


i=l 


or O(n). 
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Zeroth First Second Third 
4 So] 
Ste x] 


x, f[x] S[xo. xx, 
EP ae 
* F(x] 


x; f[x;] 


Arex. x)= Leal saa] 


x3 -*%) 


Figure 5.9 Organization of a divided difference table. 


EXAMPLE 5.8 An Arbitrary Set of Data 


Determine the Newton form of the interpolating polynomial for the following data 
set. Then use this polynomial to estimate the value of y when x = 1.5. 


zx -1 01 2 
yg & Li 


Using the data in the order given produces the following divided difference table. 
Zo=-1 fla) =5 


f(zo,4] = —4 
m1 =0 8 fimj=1 f{0, 21,22] = 2 

f\z1,22] =0 J{z0, 21, 22,23] =1 
Z2=1 f [x2] =1 flri, x2, x3] =5 

f{v2, 3] = 10 


z3=2 = fxs} = 11 
The Newton form of the interpolating polynomial is then 
Po,1,2,3(2) = flo] + f[zo,t1](@ — 20) + f[t0, 21, t2](x ~ to)(e — x1)+ 


flo, 2, 22, £3](t — 29)(x — £)(% — 22) 
=5 - A(z +1) +2(2+1)2 + (x + 1)a(x — 1). 


Evaluating this polynomial at. z = 1.5 produces the value 


Po.2.3(1-5) = 5 —4(1.5 +1) + 2(1.5 +1)(1.5) + (1.5 +1)(1.5)(1.5 - 1) 
=5—10+7.5+ 1.875 = 4.375, 


precisely the same value as produced by Neville’s algorithm in the previous section. 
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For completeness, each of the values in the divided difference table was com- 
puted as follows. For the first divided differences, 


f{zi]— flo] — 1-5 


tO eae OD 
Xo| — f [x 1-1 
eee flo] - fle) _ igi ane 
tg Ly 1-0 
<0 es) ees eds 
Feta, 29] = P= ST = 10. 
For the second divided differences, 
flo, 21) 29] = Leute = feo eal OO ana 
Lq — Zo fa(=1) 
29,23] — flzi,z 10-0 
flo.,ma,a3] = Heetal—Sleutal _ 19-0 5 
@3- 2 2-0 
Finally, for the third divided difference, 
Flo, 21,02,25] = Lee tn tel feo tute] 872 
23 — LO 2- (-1) 


EXAMPLE 5.9 Relative Viscosity of Ethanol as a Function of Anhydrous 
Solute Weight 


The following table lists the relative viscosity, V, of ethanol as a function of the 
percent of anhydrous solute weight, w. 


w 10 20 40 60 80 100 
V 1498 2.1388 2.840 2.542 1.877 1.201 


A table of relative viscosity values, starting at 5% solute weight and proceeding to 
100% solute weight in increments of 5%, is needed. Using the data in the order 
provided, the divided difference table, with values rounded to five decimal places, 
is 


10 1.498 
0.064 
20 = 2.138 —9.6333 e—4 
0.0351 —5.7334 e—6 
40 2.840 ~125e-3 2.7031 e—7 
—0.0149 1.3188 e—5 —3.8050 e—9 
60 2.842 —4.5875 e—4 —7.2141 e—8 
—0.03325 7.4167 e—6 
80 1.877 —1.375e-5 
—0.0338 


100 1.201 
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The Newton form of the interpolating polynomial relating solute weight and relative 
viscosity is then 


V = 1.498 + 0.064(w — 10) — 0.00096333(w ~ 10)(w — 20)— 
0.0000057334(w — 10)(w — 20)(w ~ 40)-+ 
0.00000027031(w ~ 10)(w — 20)(w — 40)(w — 60)— 
0.0000000038050(w — 10)(w — 20)(w — 40) (w — 60)(w — 80). 


Evaluating this polynomial at w = 5%, 10%, 15%, ..., 100% produces the desired 
table: 


w Vv w Vv 

5 1.201 55 2.682 
10. 1.498 60 2.542 
15 1.824 65 =. 2.3880 
20 2.138 70 2.210 
20 «2.411 75 2.040 
30 2.624 80 «1.877 
30 = 2.768 85 1.722 
40 2.840 90 = 1.569 
45 2,844 95 1.403 
50 2.787 100 =—-1.201 


Assessing the accuracy of the values in this table will be left as an exercise. 


In addition to being more efficient from the point of view of evaluation and 
requiring fewer operations to determine its coefficients, the Newton form of the 
interpolating polynomial has another major advantage over the Lagrange form. 
Whereas the Lagrange form must be recomputed from scratch when new data 
become available, the Newton form allows for new data points to be easily incor- 
porated. For each new data point, an additional diagonal must be computed in 
the divided difference table, and one new term must be added to the interpolating 
polynomial. The next example demonstrates this process. 


EXAMPLE 5.10 An Arbitrary Set of Data Reconsidered 
Suppose that in addition to the four data values provided earlier: 


zx -l1 0 1 2 
y 56 11 il 


it is also known that y = 5 when x = —2 and y = 35 when z = 3. Estimate the 
value of y when x = 1.5 using all six data points. 

We could arrange the six points in ascending order and generate a whole new 
divided difference table, but that would be doing more work than is really necessary. 
Instead, we can simply add the two new points to the bottom of the table we had 
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already computed, and just complete those two new diagonals. The resulting, 
augmented divided difference table is shown below. New values are displayed in 
boldface. 


aS 
= 
0 1 2 
0 1 
oe 5 4/12 
10 13/12 0 
¢. Dh 17/6 -1/12 
3/2 5/6 
a ae) 9/2 
6 
3 35 


The Newton form of the interpolating polynomial is then 
Poa,aaas(a) = 5—-A(a+1) +2(n41)0-+ (e+ N)2(—-1) ~ (2+ 1)a(2 10-2) 
Evaluating this polynomial at z = 1.5 produces the value 
Po3.2,3,4,5(1-5) = 5 —4(1.5 +1) + 2(1.5 + 1)(1.8) + (1.5 + 1)(1.5)(1.5 - 1) 
a(t 5 + 1)(1.8)(1.5 ~ 11.5 - 2) 


= 5-104 7.5 + 1.875 + 0.078125 = 4.453125. 


Some texts define the divided difference f[z9, 21, Z2,.--, Z| to be the leading 
coefficient in the unique polynomial that interpolates f at the points x, x, 22, 
..., 2p. We've already seen that our definition of divided differences leads to this 
same interpretation. It can also be shown that if f|zo, 21, 22,..-, 2x] is the leading 
coefficient in the unique polynomial that interpolates f at the points x, 21, Za, 
..-, 2g, then f[zo,21,22,-..,Z»] must satisfy the recursive formula provided in our 
definition (see Exercise 12). Hence, the two definitions are equivalent. 

It is important to note that the value of f[xo,21,22,..-, 2x] is independent 
of the order of the interpolating points zo, 21, Y2, .-., Ze. Thus, for example, 
flzo, 1,229,203] = f[v2,%1, 20,23] = f[%3,Z0,%1,22]. The reason for this indepen- 
dence is tied to the fact that f[zo,21,22,...,2%| is the coefficient of z* in the 
polynomial of degree & that interpolates f at the points 29, 21, 2, ..., Ze. If we 
let mo, M1, M2, .-., Ms denote any permutation of the integers 0 through &, in- 
clusive, then f [2m )Zmy»Lmo1-++)Zm,] is the coefficient of z* in the polynomial of 
degree k which interpolates f at the points t,, 2m,, mz, ---, Lm,- Since the set 
of interpolating points is the same in each case, the two interpolating polynomials 
must be the same, so the coefficients of x* must be equal. 
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Interpolation Error Revisited 


To finish out this section, we will show that the error in polynomial interpolation can 
be expressed in terms of divided differences. For, suppose that P is the polynomial 


that interpolates f at the points z9, 21, 2, ..., Yn. Further suppose that P* is 
the polynomial which interpolates f at the points 1, 21, %2, .--, Zn, t. It follows 
that 


P*(x) = P(x) + f[zo, 21, 22,---,;Zn, th IIe — 2). 
i=0 


Evaluating this expression at = ¢ and using the fact that P*(¢) = f(t) yields 
f(s) = P(t) 1 f\ro, 21,22, tee Enst| IIe = Xi). 
1=0 


Hence, upon rearranging terms and replacing the variable t by x, we arrive at the 
error formula 


f(z) — Plz) = f (ro, 21,22,---,2n, 2] Tl(e-20. (2) 


i=0 


Equation (2) may look different from the error term established in Section 5.1, but 
we will now show that they are the same. 

Recall that the first divided difference of a function f with respect to the 
points xo and x, is given by 

flzo, ai] = f(z1) = F(xo) 
21 — Lo 
If f is continuous on the closed interval [zo,z1] and differentiable on the open 
interval (29,21), then the Mean Value Theorem guarantees the existence of € € . 
(29,21) such that 
f (21) — f(zo) 
Zy— Xo 

Combining the last two equations gives 


f[z0, 21] = £6) 


for some € € (x9, 21), provided that f is sufficiently differentiable. 

A similar relationship holds between higher divided differences and higher- 
order derivatives. The specific nature of this relationship is established in the next 
theorem. : 


= fi(é). 


Theorem. Let 20, £1, 22, .--; 2n be n+ 1 distinct points from the closed 
interval [a, 6]. If f is continuous on [a,b] and bas n continuous derivatives on 
the open interval (a, 6), then there exists a € € (a,b) such that 


a) 


flo, 21, 22,---,2n] = aa 
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Proof. Let xo, 21, £2, ..., Zn be N+ 1 distinct points from the closed in- 
terval [a,b], and suppose that f is continuous on [a,b] and has n continuous 
derivatives on the open interval (a,b). Consider the auxiliary function 


g(z) = f(x) — Po,1,2,....n(z). 


Since Fp 1,2,....n is continuous and infinitely differentiable everywhere, it follows 
that g is continuous on [a,b] and has n continuous derivatives on the open 
interval (a,b). Furthermore, since Po1,2,...., interpolates f at each of the 
points 29, 21, L2,...) Ln, 


g(xi) =0 for each 17 =0,1,2,...,7. 


Applying the Generalized Rolle’s Theorem, there exists € € (a,b) such that 
g)(€) = 0. Hence 


0= 9) = SMO) — Pea nl): 


To complete the proof, recognize that Fo1,2,...n is a polynomial of degree n 
with leading coefficient f[z9,21,22,..-,Zn]- Therefore, 


PAD» (6) = rifle, £1, 22,--., 2a). 


Substituting this result into the previous equation and solving for the divided 
difference yields 


2 


f[Z0,21,22,-.-, Zn = 


Using this theorem, it follows that for each x & (a,b), there exists € € (a,b) 


such that 


se) 


f[%o,%1,22,---,2n,Z] Fay (n+1)! : 


(3) 


Substituting (3) into (2), we obtain the error formula from Section 6.1. 


EXERCISES 


1. 


Assess the accuracy of the values in the relative viscosity table developed earlier 
in this section by plotting the values from the table and the six given data values 
on the same set of axes. 


. A more extensive table lists the viscosity of ethanol as 2.209 when the anhydrous 


solute weight is 70%. Add this value to the bottom of the divided difference 
table provided in the example in the text and compute the new values at the 
bottom of each column. What is the interpolating polynomial using seven data 
points rather than the original six? 


. Construct the divided difference table for the following data set, and then write 


out the Newton form of the interpolating polynomial. 


x 2 4 8 
y -1 4 8 
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10. 


11. 


12. 


13. 
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. Construct the divided difference table for the following data set, and then write 


out the Newton form of the interpolating polynomial. 


x 0 1 2 
y 2 -l 4 


. Construct the divided difference table for the following data set, and then write 


out the Newton form of the interpolating polynomial. 
xz —l 0 le <2 


. Construct the divided difference table for the following data set, and then write 


out the Newton form of the interpolating polynomial. 
@ -7 -5 -4 -1 
y 1 5 2 10 


. Write out the Newton form of the interpolating polynomial for f(z) = ng that 


passes through the points (1,1n1), (2,1n 2), and (3, In3). 


. Write out the Newton form of the interpolating polynomial for f(z) = sinz that 


passes through the points (0,sin 0), (1/4,sin 1/4), and (7/2, sin 1/2). 


. Write out the Newton form of the interpolating polynomial for f(x) = e* that 


passes through the points (—1,e7'), (0,e°), and (1, e"). 
Determine the missing values in the divided difference table provided below. 
2o=0 f[xo] =—-1 
flzo,ai1)=5 
ay=1 fla] =? flo, 21,22] =—-3 
f{z1, £2] =? 
zg=2  f[z2] =? 


Determine the missing values in the divided difference table provided below. 
zo=0 f[zo] =1 


f|zo, #1] =2 
ey=1 flaj=3 fo, 1, a2] =? 
f (21, £2] =? f{xo, 21,2, %3] =? 
zg=2 fire] =3 flzi, 22,23] =0 
f ize, x3] =0 
z3=3 f[x3] =? 
Let f[vo, 21, %2,---, x] be defined as the leading coefficient in the unique poly- 
nomial which interpolates f at the points zo, 71, £2, ..., Ze. Show that 
fli, 22, ---, 2k! — flxo, 21, ce .,2k-1] 
flzo, 21, 22,..-, 2%] = 


Lig — LO 


The values listed in the table provide the surface tension of mercury as a function 
of temperature. 


Temperature (°C) 10 25 50 75 100 
Surface Tension (dyn/em) 488.55 485.48 480.36 475.23 470.11 
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Use these values to determine the Newton form of the interpolating polynomial, 
and then use the polynomial to produce a ‘table of surface tension values for 
temperatures ranging from 5°C through 100°C in increments of 5°C. Assess the 
accuracy of the table by plotting the values from the table and the five given 
data values on the same set of axes. 


14, The thermal conductivity of air as a function of temperature is given in the table 
below. Estimate the thermal conductivity of air when T = 240K and when 
T = 485 K, using the Newton form of the interpolating polynomial. 


Temperature (K) 100 200 ©6300 400 500 © 600 
Thermal Conductivity (mW/m-K) 9.4 184 26.2 33.3 39.7 45.7 


15. Experimentally determined values for the partial pressure of water vapor, pa, 
as a function of distance, y, from the surface of a pan of water are given below. 
Estimate the partial pressure at distances of 0.5 mm, 2.1 mm, and 3.7 mm from 
the surface of the water. 

y (nor) 0 1 2 3 4 5 
pa (atm) 0.100 0.065 0.042 0.029 0.022 0.020 


16. Ammonia vapor is compressed inside a cylinder by an external force acting on 
the piston. The ammonia is initially at 30° C, 500 kPa and the final pressure is 
1400 kPa. The following data have been experimentally determined during the 
process. Use the Newton form of the interpolating polynomial to determine a 
table of volume as a function of pressure, with pressure ranging from 500 kPa 
through 1400 kPa in increments of 50 kPa. 


Pressure (kPa) 500 653 802 945 1100 1248 1400 
Volume (1) 1.25 1.08 0.96 0.84 0.72 0.60 0.50 


Table A.5 in Frank White, Fluid Mechanics, lists the following values for the surface 
tension, Y, vapor pressure, p,, and sound speed, a, for water as a function of tempera- 
ture. Use this data for Exercises 17-19. 


TC) Y (N/m) py (kPa) a (m/s) 
0 0.0756 0.611 1402 
10 0.0742 1.227 1447 
20 0.0728 2.337 1482 
30 0.0712 4.242 1509 
40 0.0696 7.375 1529 
50 0.0679 12.34 1542 
60 0.0662 19.92 1551 
70 0.0644 31.16 1553 
80 0.0626 47.35 1554 
90 0.0608 70.11 1580 

100 0.0589 101.3 - 1543 


17. Use the Newton form.of the interpolating polynomial to determine the surface 
tension of water when T = 34° C, 68°C, 86° C, and 91°C. 
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18. Use the Newton form of the interpolating polynomial to determine the vapor 
pressure of water when T = 34°C, 68° C, 86°C, and 91°C. 


19. Use the Newton form of the interpolating polynomial to determine the sound 
speed of water when T' = 34°C, 68° C, 86°C, and 91°C. 


5.4 OPTIMAL POINTS FOR INTERPOLATION 


In many practical situations, we are not free to choose the interpolating points. 
For example, the function we wish to interpolate may be known only in terms of 
experimental data or from some standard engineering table. When we do have the 
freedom to choose the interpolating points, a natural question arises. Can we select 
the interpolation points so as to minimize the interpolation error? 

It turns out that we can readily minimize a portion of the error, but the details 
depend upon the norm being used to measure the error. Here, we will consider only 
the 1., and lo norms. For a function f, continuous over the interval [a,b], these 
norms are defined by : 


“fll = ax, |F(2) 
and | 
; 1/2 
Illa = ( a iteyPan) 7 
a 
Remember that when we interpolate the function f over the interval [a,b] at 
the n +1 points zo, 21, 29, ..., Zn, the interpolation error is given by 
(n-+1) 
J(2) — pala) = FD (a — ae)(e ~ 21) (e ~ 22) (2 2a). 


Here, py is the interpolating polynomial of degree at most m based on the n+ 1 
interpolating points, and min(zo, 21, Z2,..-,%n,%) < € < max(x, 71, £2)... ,2n,)- 
Since f is given and n has been fixed, the only portion of the error formula that we 
can control is the polynomial 


w(x) = (a — 29)(2 — 21)(@ — @2)--- (4 — Zp). 


In this section, we will determine the interpolation points which make the l,. and 
the lg norm of this polynomial as small as possible. 


Chebyshev Polyne:niais 


When working with the J,.-norm, the optimal points for interpolation are related 
* to a special family of functions known as the Chebyshev polynomials. 


Definition. For each nonnegative integer n, the CHEBYSHEV POLYNOMIAL 
T,, is defined for z € [—1,1] by 


Ta (x) = cos(ncos~’ x). 


Section 5.4 Optimal Points for Interpolation 375 


Granted, these functions don’t look much like polynomials, but we will shortly 
see that each T,, is, in fact, an nth-degree polynomial. 

The first order of business is to derive a recurrence relation for the Chebyshev 
polynomials. This is a formula that indicates how to construct one Chebyshev 
polynomial from others in the family. For notational convenience, let 6 = cos~' 2. 
Then T,,(cos @) = cosn6. Using the standard trigonometric identities for the cosine 
of a sum and of a difference of two angles 


Tn41(cos 8) = cos{(n + 1}@| = cos nd cos @ — sin né sin é 
and 
Tr—1 (cos 0) = cos|(n — 1)@] = cosné cos @ + sin n@ sin 6. 


Adding these two expressions gives 
Tr+1(cos @) + Tr-1 (cos @) = 2.cos 6 cosné = 2cos 6 Tp, (cos 6). 
Finally, solving for T,4; and returning to the variable z yields 
Tn4i(2) = 22Tn (x) ~ Tr-s(2), 


for any n > 1. 
Before the recurrence relation can be used, simple expressions for both To(z) 
and T\(z) are needed. Substituting n = 0 into the definition gives 


T(x) = cos(0 - cos~* ) = cos0 = 1, 
while, for n = 1, 
Ty (x) = cos(1-cos~* z) =. 


Now, applying the recurrence relation, the next three Chebyshev polynomials are 
found to be 
T2(x) = 227; (x) — To(x) = 2x? - 1, 


T3(x) = 22T) (x) = T; (x) 


= 2a(227 — 1) — 2 = 42° — 3a, 
and 
Ts(x) = 22T3(x) — To (x) 
= 22(4e° ~ 3a) — (227 — 1) = 8a* — 827 +1. 
These polynomials suggest that 


1. for each n, T,(z) is an nth-degree polynomial; 
2. for each n > 1, the leading coefficient in T, is 2"~1; and 
3. T,, is even (odd) when n is even (odd). 
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Figure 5.10 Graphs of the Chebyshev polynomials of degree one 
through four. 


A simple inductive argument can be used to show that each of these statements is 
true in general. 

Graphs of the Chebyshev polynomials of degree one through four are shown 
in Figure 5.10. Two important observations can be made from these graphs. First, 
each Chebyshev polynomial appears to have all of its roots on the interval [—1, 1], 
with each root being of multiplicity one. Second, on the interval [—1,1], each 
polynomial appears to oscillate between a maximum value of +1 and a minimum 
value of —1, with the number of extreme values being one more than the number 
of roots. The following theorem makes these observations more precise. 


Theorem. The Chebyshev polynomial T,,(x), of degree n > 1, has n simple 
roots on the interval [—1, 1] at 


ee ee 
23 = COS Dn 


for j = 1, 2, 3,..., n. Moreover, T;,() has absolute extreme values on [—1, 1] 


at rh 
T 

z= 008 (2) 

n, 


for j = 0, 1, 2,.-., , with Ta(z;) ~(-1). 


Proof. Let n > 1, and let 
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for j = 1, 2, 3,..., n. Then 


? %-1 
Tn{x;) = cos {cos t 28 ( on r)|\ 
= 0s (75) =0, 


and each z; is a root of Ty. Since T, is a polynomial of degree n, the x; must 
be the only zeros of T;,, and since they are distinct, they are all simple roots. 


Recall that for a function defined over a closed interval, extreme values 
can occur only at critical points or at the endpoints of the interval. The 
critical points of T,, are the roots of 
__ nsin(ncos7* 


x) 
ar 


jt 
,=C 
23 os ( ) 


for 7 =0,1,2,...,n. For 7 =1, 2, 3,...,n-1, 


T, (2) 


Now, let 


ee ee 
aaa a 
Since T,, is a polynomial of degree n, T/, will be a polynomial of degree n — 1; 
therefore, the points z), 22, Z3,..., 2n—1 account for all of the roots of TY, and, 
hence, all of the critical points of T,. Note that zo = 1 and z, = —1 are the 
endpoints of the interval under consideration, so the z; are the only possible 
candidates for the extrema of T,,. At each of the z;, a direct calculation shows 


that 
Tn (z;) = cos {n cos”? jos (=)| \ 


= cos (jm) = (—1). 


For j even, T;, therefore achieves its absolute maximum value on [~1, 1] of +1, 
while for j odd, T;, achieves its absolute minimum value on [-1, 1] of -1. O 


For our current objective, that of determining the interpolating points which 


will minimize the maximum norm of the interpolation error, the most important 
property of the Chebyshev polynomials is the minimaz property. One final prelim- 
inary item is needed before this property can be discussed. 


Definition. A polynomial is Monic if its leading coefficient is +1. Let II, 
denote the set of all monic polynomials of degree n. 
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With the exception of Ty and T,, the Chebyshev polynomials are clearly not 
monic, as the leading coefficient is of the form 2771, However, the polynomials 


- _ { TZo(z), n=0 
Tn(z) = { 21-"P (2), n>1 


are monic. These are known as the monic Chebyshev polynomials. Since T,, is just 
a multiple of T;,, the roots and extrema of Tr are located at exactly the same points 
as the roots and extrema of T,. The extreme values of T,, though, are reduced 
to +2-”, 


Theorem. The monic Chebyshev polynomial T, (n > 1) satisfies 


= r < me 
gat = [Zn(z}l Ss poe Pat 


for any Dn € gee with equality if and only if p, = Ty. 


Proof. Suppose pp, € TM, with 


< — L : 
BS lPn(z)| < Rael aS n(x)| 


Let g(z) = Tn(z) — pa(z). Since T, and pp are both monic polynomials of 
degree n, it follows that q is a polynomial of degree at most n— 1. At the 
extrema, of 77, 


9(23) = Tn (24) — pa(zs) 
= 2)-"(-1) — Pn (zy). 


By supposition, |p(z,;)| < 2?~”, so 


q(z;) > 0 whenever j is even 
g(z;) <0 whenever j is odd. 


Therefore, g is guaranteed to have at least one root between z; and z;+; for 
each 7 =0, 1, 2,...n—1. But this implies that g is a polynomial of degree 
at most n — 1 with at least n roots, which is impossible unless g = 0. Thus 

[ o 
Pn = Tn. 


Chebyshev polynomials have many other interesting and important properties, 
a few of which will be explored in the exercises. For a more complete discussion of 
Chebyshev polynomials, encompassing their role in both applied mathematics and 
numerical analysis, consult Rivlin [1] or Fox and Parker [2]. 
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Optimal Interpolating Points Relative to Maximum Norm 


Suppose we wish to interpolate the function f over the interval [—1, 1] at then +1 
points 29, £1, Za, ..., Zn, choosing these points so as to minimize the /,. norm of 
the polynomial 


w(x) = (2 - helt —41)(@ — £2)---(@- 2). ° 


Note that w(«) happens to be a monic polynomial of degree n + 1 whose J,. norm 
we desire to make as small as possible. This can only be accomplished by taking 


w(x) = Tn41(x), which can only happen if the interpolating points are chosen to be 


the roots of T,,41(z). Thus to minimize the J, norm of w, the interpolating points 
must be chosen as 


2i+1 
— —— PS Oe] ences 
Ly = cos (Z 5") 7=0,1,2,...,” 


With this choice of interpolating points, it follows that |lwllo¢ = 27"; hence, over . 
the interval [-1, 1], 
[f° leo 


_ < . 


EXAMPLE 5.11 The Optimal Interpolating Points - Maximum Norm 


Let’s interpolate the function f(z) = sina over the interval [—1,1] with a poly- 

nomial of degree at most 4. The interpolating points which will minimize the Joo 
norm of w are therefore the roots of Ts: 

n 70 vig 0 é 1 On 

to = CcOS—, 21 =cos—, 22 =cos~ = Ig =cOS—~, 2X4 = cos ——. 

10’ “? 1g' °° aa ae ieee 10 


For comparison, we will also interpolate f at the uniformly spaced points: 


%o=-1, 2, =-0.5, t2=0, 23 =0.5, ze =1, 


and a third set of interpolating points—an explanation for which will be provided 
later in the section: 


rq = —0.90618, x, = —0.53847, v2 =0, 23 = 0.53847, 24 = 0.90618. 


Figure 6.11 displays the absolute value of the interpolation error for the in- 
terpolating polynomial of degree at most four based on each of these three sets of 
points. Although the uniformly spaced points and the third set of points produce 
smaller errors in the middle section of the domain, the Chebyshev points produce 
substantially smaller errors at the edges of the domain and a substantially smaller 
maximum error. In particular, the J. norms are roughly 0.11556, 0.18076, and 
0.19216 for the Chebyshev points, the uniformly spaced points, and the third set 
of points, respectively. 
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Figure 5.11 Comparison of interpolation error in the interpolating 
polynomial of degree at most four based on Chebyshev interpolating 
points, uniformly spaced interpolating points and a third set of interpo- 
lating points. 


What if the interval under consideration is not [-1,1]? Note that the linear 
function. 


2 2 
translates and scales the interval -1 < t < 1 onto the interval a < z < b. To 
minimize the /,. norm of w over the general interval [a,b], we should therefore 


choose 
b-a 2i+1 ) a+b. 
{= — ——~ — , = Oil 25.02 04 
zt 9 cos (athe + 5 4=0,1 n 
as the interpolating points; that is, the scaled and translated roots of the monic 
Chebyshev polynomial T,,41. 

EXAMPLE 5.12 An Interval Other Than [—1, 1] 

Let’s interpolate the function f(z) = xe7* over [—1, 3] using a polynomial of degree 


at most four. The properly scaled and translated roots of 75 are 


x 3 T 
t =1+2cos—7, 2 =1+2cos 7, 2 = 1+ 2cos > =), 


Tt on 
= ae = 2 ran 
x3 1+ 2cos 77, tq = 1+ 2cos 0 


Section 5.4 Optimal Points for Interpolation 381 


The interpolating polynomial generated by these points is 
p(x) = —0.0601124 + 0.433762 — 1.110112? + 1.08627x + 0.01807, 


and. ||f ~ palloo = 0.04191. It is left as an exercise to verify that interpolation at 
uniformly spaced points and at the third set of points introduced earlier, translated 
and scaled to [—1, 3] of course, both produce larger maximum errors. 


Legendre Polynomials 


Another special family of functions that arises in applied mathematics is the Leg- 
endre polynomials. 


Definition. For each nonnegative integer n, the LEGENDRE POLYNOMIAL 
P,,(x) satisfies the recurrence relation 


2n—1 n—1 
= LPp—1(a) — 
n 


with Po(z) =1 and P\(z) =z. 


Pp (a) Pr-2(2), 


Since Po(x) = 1 and P,{x) = z, the recurrence relation implies, through 
mathematical induction on n, that each P, is a polynomial of degree n. In partic- 
ular, 

1 


3 3 1 
Py(z) = 5aPi(2) — 5 Pola) = 5t gi 


P3(x) = 22P,(2) - =A(2) = 548 — 52; and 


7 , 383 
P,(2) = qrPalz) - gis (z) = < << ted + ms 

We now turn to a discussion of some of the relevant properties of the Legendre 
polynomials. 

First, the Legendre polynomials satisfy the following integral relationship 
(Hildebrand [3]): 

1 ; 

0, k 
/ P,(2)P,(2) de | ae ae ; (1) 
-1 Wa J= 


In general, two functions f and g for which 


b 
[ sevoteruta) de = 0 


are said to be orthogonal on [a,b] with respect to w(x). The function w, called a 
weight function, must be nonnegative and integrable on {a, | and can evaluate to 
zero only at isolated points along [a,b]. When any pair of different functions chosen 
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from a set of functions is orthogonal, those functions are said to form an orthogonal 
set. In light of these definitions, equation (1) tells us that the Legendre polynomials 
form an orthogonal set on [-1, 1] with respect to the weight function w(x) = 1. In 
Exercise 3, we will see that the Chebyshev polynomials also form an orthogonal set 
on the interval [-1, 1], but with respect to the weight function w(x) = (1 —2?)—1/?! 

Next, let k be a nonnegative integer, and consider the set {Fp, P;, Pa, ..., Px}. 
Suppose there exist constants co, C1, C2,..--, Cz such that 


coPo(z) + Py (a) + 09 Fafa) Si Ch Py{x) =0 


for all z. Multiply this equation by P(x) and then integrate from ¢ = —1 to 
xz =+1. This yields 


1 &k k 1 F 
| Morepear= Yo | P@)Bi@)az = 0. 
—1 4=0 i=o ot 


Orthogonality guarantees that each integral, except the one containing [P;(x)]*, 
vanishes. This leaves 


; 2 
cj [ Bere = 41° => 0, 

or ¢; = 0, for each j = 0,1,2,...,4. Hence, the set {Po, Pi, Pe,..., Px} is linearly 
independent. Combining linear independence with the fact that each P;(z) is a 
polynomial of degree j, it follows that {Fo,P1, Pa,.. ., Px} spans the space of all 
polynomials of degree at most k. 

Finally, let k and n be nonnegative integers with k <n. Further, let q be an 
arbitrary polynomial of degree k. Since {Po, Pi, P2,..., Px} spans the space of all 
polynomials of degree at most k, there exist constants Co, C1,C2)--- Ck such that 


k 
g(x) = coPo(a) + er Pr(z) + c2Po(2) +--+ ce Pe (2) = > oy Pi(2). 
j=0 
Then 


1 J k 
[aerate ae = | ee) P,(a) de 
=a -1 520 ; 


= : ; n(x) di 
doe | Bal \da 
= 


by the orthogonality of the Legendre polynomials. Hence, each Legendre polyno- 
mial P, is orthogonal to all polynomials of degree less than n over the interval - 
[—1, 1] with respect to the weight function w(x) = 1. 
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If we take each Legendre polynomial and divide through by the leading co- 
efficient, the set of monic Legendre polynomials, {Po,P,, P2,...}, results. Since 
each FP; is just a constant multiple of P;, the monic Legendre polynomials inherit 
all of the linear independence and orthogonality properties of the original Legendre 
polynomials. 


Optimal Interpolating Points Relative to Euclidean Norm 


Suppose we wish to interpolate the function f over the interval [—1, 1] at then +1 
points Zo, 21, L2, ..., Xn, choosing these points so as to minimize the lg norm of 
the polynomial 


w(x) = (x — xo)(a — £1)(% — %2)-+- (4 — Bp). 


Consider the quantity w(x) — Py 44(z). Since both polynomials are monic of degree 
n+1, their difference must be a polynomial of degree at most n, call it g(x). Keep 
in mind that since q is a polynomial of degree at most n, the integral 


1 
/ Pras (x)a (er) de 


1 
is guaranteed to be equal to 0. Now, with w(x) = Pa+1(z) + 4(z), it follows that 


loll = (Pasa + al 


=a (Faale) +ale)) ae 


ak 
= |PrlZ+2 / Pus (a)q()dx + |lall 
-1 


= ||Paall2 + lll. 


The Jg norm of w will therefore be minimized when q(x) = 0; that is, when 
w(z) = P,+41(2). Hence, to minimize the lz norm of w when working on the interval 
[—1, 1], the interpolation points must be chosen as the roots of the monic Legendre 
polynomial P,+1(z). In Section 6.6 we will prove that these roots are all simple 
and all lie inside the interval [—1, 1]. 

Interpolation over the more general interval [a,b] is handled in precisely the 
same manner as noted earlier for the J,, norm. The roots of the appropriate monic 
Legendre polynomial are scaled by the factor (b— a)/2 and translated by (a + 0)/2. 
Interpolation at the resulting points will then minimize the J) norm of w. 


EXAMPLE 5.13 The Optimal Interpolating Points—Euclidean Norm 


Let’s interpolate the function f(x) = sin wx over the interval [—1, 1] with a polyno- 
mial of degree at most 4. The interpolating points that will minimize the lz norm 


of w are the roots of 


x 1 
P(x) = 2° — =o + ao. 
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To five decimal places, the roots of this polynomial are 
to = —0.90618, zy = —0.53847, 22 =0, 23 = 0.53847, xq = 0.90618, 


Note that these points form the third set of interpolating points which were used 
in the first example of this section. For comparison, we will also interpolate f at 
the uniformly spaced points 


Zo=—-1, x)= —0.5, a2=0, 23=0.5, 24 =1, 
and the Chebyshev points 


_ 1 = 3m T 70 

Lo = cos Th © = cos i0° Lo C08 0, 23 =cos 10” tq eT 
Recall that Figure 5.11 displays the absolute value of the interpolation error for the 
interpolating polynomial of degree at most four based on each of these three sets of 
points. The /, norms are roughly 0.09391, 0.10876, and 0.13964 for the Legendre 
points, the Chebyshev points and the uniformly spaced points, respectively. 

Let’s also interpolate the function f(z) = xe* over [-1, 3] using a polynomial 
of degree at most four. The properly scaled and translated roots of Ps are 


xo = 1+ 2(—0.90618) = —0.81236, 2, = 1 + 2(—0.53847) = —0.07694, 


a2 =14+2(0)=1, 23 = 14 2(0.53847) = 2.07694, 
4 = 1+ 2(0.90618) = 2.81236. 
The interpolating polynomial generated by these points is 


pa(x) = —0.058412* + 0.4182023 — 1.077212? + 1.07882x + 0.00648, 


and ||f — pall2 = 0.03916. It is left as an exercise to verify that interpolation at 
uniformly spaced points and at the translated and scaled Chebyshev points both 
produce larger errors in the lp norm. , 
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EXERCISES 
1. Prove each of the following properties of the Chebyshev polynomials: 
(a) for each n, T, (1) = 1. 
(b) for each n, Tr(—1) = (-1)”. 


‘ 
8. 


Section 5.4 Optimal Points for Interpolation 385 


(c) for all j > k > 0, T;(x)Tifx) = § [Tj+n(z) + Tj-x(2)]- 


. Show that the Chebyshev polynomial T(z) is a solution to the differential equa- 


tion 


d? d 
(1 ole -«f +n*y =0. 


. Show that 


: Tr (2) Tm (2) eae 0, Men 
-. Vl-2? Cry, Man ’ 

where co = 2 and c, = 1 (n > 1). This implies that the Chebyshev polynomials 

form an orthogonal set on [—1,1] with respect to the weight function w(x) = 

(1—.27)~/?, (Hint: Make the substitution @ = cos~? 2.) 


. Show that the Legendre polynomial P,(z) is a solution to the differential equa- 


tion 5 
d 
i ae — 2254 + n(n+ Iy = 0. 
. Consider interpolating f(x) = ze~* over [—1,3] with a polynomial of degree at 
most four. 


(a) Interpolate at uniformly spaced points and at the scaled and translated 
Legendre points. Determine the Jo. norm of the interpolation error for both 
interpolating polynomials and compare with the lo. norm associated with 
the scaled and translated Chebyshev points. 


(b) Interpolate at uniformly spaced points and at the scaled and translated 
Chebyshev points. Determine the l2 norm of the interpolation error for 
both interpolating polynomials and compare with the /2 norm associated 
with the scaled and translated Legendre points. 


. For each of the following intervals, identify the interpolating points that minimize 


the loo and the [2 norm of w for linear interpolation. 
(a) [-1,1) — (b) [0.3.5] (ec) [-7,0] (A) [-vV2,3] (e) [-2.5,3.5] 
Repeat Exercise 6 for cubic interpolation. 


Repeat Exercise 6 for interpolation by polynomials of degree at most 5. 


For Exercises 9-13, interpolate the given function over the specified interval by a poly- 
nomial of the indicated degree. Interpolate at uniformly spaced points, the Chebyshev 
points and the Legendre points, and compare the errors in the resulting polynomials in 
both the lg and the l2 norm. 


9. 
10. 
11. 
12, 
13. 


f(a} =e”, [-1,1], 2 =3 
f(x) =e7*, [-1,2],n =3 


f(z) =aInz, [1,3],2=4 
f(z) = In(x + 2), [-1,1], n=5 
f(x) = 1/z, [1,4],n =5 
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5.5 PIECEWISE LINEAR INTERPOLATION 


Consider approximating the function f by an interpolating polynomial using data 
from n+ 1 points, and suppose that the resulting approximation is not accurate 
enough. We could introduce data from more points and use a higher-degree poly- 
nomial; however, using a higher-degree polynomial could introduce undesired oscil- 
lations, as seen in the examples and exercises of Sections 5.1 through 5.3. 

Furthermore, interpolation at equally spaced abscissas can lead to poor results 
in the sense that there are smooth (infinitely differentiable) functions for which the 
interpolation error goes off to infinity as the number of data points increases. As 
an example, consider the function 


1 
Fl) = Tyee 
on the interval [-1,1]. If we interpolate this function at the points z, = —1+ 
Qi/n (¢=0,1,2,...,n), then the maximum norm of the difference between f and 


the interpolating polynomial as a function of n is summarized in the following table. 


n Maximum Norm 


4 0.428 
8 1.045 
16 14.394 


32 5059.033 
64 1.078 x 109 


Figure 5.12 shows f and the interpolating polynomial with n = 8. Note the large 
oscillations in the interpolating polynomial at the edges of the domain. 

By using just one polynomial to approximate a function, we also run the risk 
that singular behavior at one point could lead to slow convergence over the entire 
domain. Consider f(z) = «/|z] on [-1,1], which has singular derivatives to all 
orders at 2 = 0. We interpolate this function at the Chebyshev points 


og (2i + 1)x 
He an+2 
fori = 0,1, 2,..., n. The table below summarizes both the maximum norm and 


the Euclidean norm of the interpolation error as a function of n. 


nm Maximum Norm Euclidean Norm 


4 0.3335 0.2794 
8 0.2466 0.1602 
16 0.1789 0.08645 
32 0.1283 0.04503 
64 0.09142 0.02299 
128 0.06489 0.01162 


Note the maximum norm decreases like 1/,/n, while the Euclidean norm decreases 
like 1/n. Hence, to achieve an interpolation error on the order of 1073 measured 
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Figure 5.12 Plot of f(z) = 1/(1 + 25a”) and the interpolating poly- 
nomial, P(x), generated using data from the equally spaced points 
a, = —1+4/4 fors=0,1,2,...,8 


in the maximum norm, we would need to select n ~ 10°. For an error on the order 
of 107° measured in the Euclidean norm, we would need n = 10°, 

These phenomena suggest that we should develop an alternative procedure for 
improving interpolation accuracy. In particular, we will introduce more points, but 
use 2 lower-order polynomial on each of the subintervals defined by the interpolation 
points. This idea gives rise to the notion of piecewise polynomial interpolation. This 
section is devoted to the simplest case of piecewise polynomial interpolation, that 
of piecewise linear interpolation. One type of piecewise cubic interpolation, known 
as cubic spline interpolation, will be developed in the next section. Hermite cubic 
interpolation, a second type of piecewise cubic interpolation that uses both function 
and derivative data, will be treated in Section 5.7. 


The Piecewise Linear Interpolant 
Let f be a function defined on the interval [a, b], and let 
2=%o <4 Sag <0 Sop < In =) 


be the n+1 distinct points at which f is to be interpolated. Note the collection of 
n subintervals into which the 2; divide [a, b] is called a partition of a, bd]. 


Definition. The PIECEWISE LINEAR INTERPOLANT of f relative to the par- 
tition 
-A@=f0< 2) <2 <-+: < Gn < Tn Hh 
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is a function s that satisfies 
(1) s is continuous on [a, 5]; 


(2) on each subinterval [2;,244;], ¢ = 0, 1, 2,...,n—1, 8 coincides with the 
linear polynomial 


s(t) = s;(2) =a, +bi(x — 24); 
(3) s interpolates f at x9, 21, 22, ..., Ln. 


At x = 2;, the interpolation condition gives 
Jf (ti) = s(x) = 8i(az) = a. 


Continuity of the piecewise linear interpolant requires $;(@i41) = 8i41(2i41), which 
yields the equation 
Q; + bs(2i41 — 24) = Gi41, 
whose solution is 
bp — CLG _ Sltins) — P(e) 

: Tj41 — Vi-1 — Vi 
The simplicity of the formulas for the a; and the b; makes piecewise linear inter- 
polation ideal for hand calculations from tabulated data. In fact, it is common 
practice to use piecewise linear interpolation from engineering and thermodynamic 


tables. 


EXAMPLE 5.14 Viscosity of Sulfuric Acid 


The following table gives the viscosity of sulfuric acid, in millipascal-seconds (cen- 
tipoises), as a function of concentration, in mass percent. From these data, we 
would like to estimate the viscosity when the concentration is 5%, 63%, and 85%. 
We could attempt to use a fifth-degree interpolating polynomial, but here we will 
use a. piecewise linear interpolant. 


Concentration 0 20 40 60 80 100 
Viscosity 0.89 1.40 2.51 5.37 17.4 24.2 


Using the formulas derived above for the coefficients of a piecewise linear inter- 
polant, we find for this data that 


aq = 0.89, 0, = 1.40, ag = 2.51, a3 = 5.387, a4 = 17.4 


and 


1.40 — 0.89 2.51 — 1.40 5.37 — 2.51 
Ly ee = —— = 0.055 = ——— _ = 0.143 
= 30 0.0255, by 20 0.0555, bz 0 


17.4 — 5.37 24.2 -—17.4 : 
=n = —.— __ = 0.34. 
3 0 0.6015, bg 50 


bo 
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Figure 5.13 Piecewise linear interpolant for viscosity of sulfuric acid 
as a function of concentration. Data points are indicated by asterisks. 


Hence, the interpolating function relating viscosity to concentration is 


0.89 + 0.0255C, 0<C< 20 
1.40 + 0.0555(C ~ 20), 20<C < 40 
viscosity = ¢ 2.51+0.143(C — 40), 40<C < 60 
5.37 + 0.6015(C — 60), 60<C < 80 
17.4 +0.34(C — 80), 80< C < 100 


A graph of this function is shown in Figure 5.13. Though the function is continuous, 
it clearly is not differentiable at any of the interpolating points. Regardless, the 
piecewise linear interpolant provides a reasonable representation for the data. 

As with any piecewise function, evaluation of a piecewise linear interpolant is 
a two-step process. We must first determine which polynomial piece needs to be 
evaluated, based on the value of the independent variable. Once the polynomial 
piece has been selected, we can then evaluate at the given value of the independent 
variable. For example, to estimate the viscosity when the concentration is 5%, 
the first polynomial piece, 0.89 + 0.0255C, is selected. We then estimate that the 
viscosity is 


0.89 + 0.0255(5) = 1.0175 


when the concentration is 5%. For the viscosity when the concentration is 63%, we 
find 


5.37 + 0.6015(63 — 60) = 7.1745, 
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using the fourth linear polynomial. Finally, when the concentration is 85%, we 
estimate the viscosity to be 


17.4 + 0.34(85 — 80) = 19.1. 


Error in Piecewise Linear Interpolation 
The final issue to consider is the error introduced by piecewise linear interpolation. 


Theorem. Let f be continuous, with two continuous derivatives, on the in- 
terval [a,b], and let s be the piecewise linear interpolant of f relative to the 
partition 

Q=% <2) M89 <5) << By) <n =D. 
Then 


i fe 
max |f(x) — s(x)| < gi max, f(z), 


where h = maxg<icn—2(Tig1 — 3). 
Proof. The key to establishing this result is recognizing that on each subin- 


terval, faz, 2:41], standard linear interpolation is being performed, so the stan- 
dard linear interpolation error formula (see Section 5.1), 


f(x) - a) = 51" €) |e — ai)( ~ 2441) 
holds, where x; < € < 2:41. Therefore, 
1 
max |f(z)—s;(2)|< 5 max |f"(z)|-__ max |(z—a))(e~ esr) ]- 
2e(%e.%,41) 2 t€[a.,z243] r€[z5 141] 


Let g(x) = |(2 ~ 2s)(2 ~ 2141) |. On (ze, 2:42], g attains its maximum value of 
h?/4 when x = (x, + 2j41)/2, where hy = 7241 ~ ti. Substituting this value 
into the above error bound produces 


1 ” lis 
max x2) — 3;(x — max z)|- —hj 
eet, )~ s(x) Sa) fe IP" (@)- gh 
1 
- ah? i 
8 * Ela. max IF (2) 
Since 
_- = ax x) — 8;{2}1), 
max | \ f(z) - s(x) = ome (ima ) — s:{x})) 
it follows that 
1 2 if 
imax [f() — sla) S 3 apex fa: ofhex Cmax | £"(x))) 
1 


where 4 = MaXo<i<n—i Mi. DB 
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EXAMPLE 5.15 Constructing a Table of Sine and Cosine Values 


A textbook publisher needs a table of values for the sine and cosine functions. 
Each function is to be tabulated at equal increments running from 6 = 0° through 
6 = 45°. The values in the table will be given to six decimal places, and the 
increment in 6 must be selected small enough so that linear interpolation between 
consecutive values will introduce an error less than 107°, 

To accomplish this objective, we must select A so that 


h? ” 6 

— max 8)| < 10°", 

3 genes lt (9) 
where the argument @ has been given in radians and f can be taken as either the 
sine or the cosine function. In either case, 

max |f”(6)| <1, 
eel (@)I s 

leading to h < 2.828 x 107% radians. Converting to degrees, any increment less than 


0.162 degrees will suffice. For convenience, we might therefore suggest an increment 
of one-tenth of a degree. 


EXERCISES 


Exercises 1 through 6 are based on the following data for the density, p, viscosity, u, 
kinematic viscosity, ”, surface tension, Y, vapor pressure, p,, and sound speed, a, of 
water as a function of temperature. These data were drawn from Tables A.1 and A.5 
in Frank White, Fluid Mechanics: 


T p Lb v Ty Pu a 
(°C) (kg/m3)  (x1078 Nis/m?) (x 1075 m?/s) (N/m) (kPa) (m/s) 
0 1000 1.788 1.788 0.0756 0.611 1402 
10 1000 1.307 1.307 0.0742 1.227 1447 
20 998 1.003 1.005 0.0728 2.337 1482 
30 996 0.799 0.802 0.0712 4.242 1509 
40 992 0.657 0.662 0.0696 7.375 1529 
50 988 0.548 0.555 0.0679 12.34 1542 
60 983 0.467 0.475 0.0662 19.92 1551 
70 978 0.405 0.414 0.0644 31.16 1553 
80 972 0.355 0.365 0.0626 47.35 1554 
90 965 0.316 0.327 0.0608 70.11 1550 
100 958 0.283 0.295 0.0589 101.3 1543 


1. Estimate, using piecewise linear interpolation, the density of water when T = 
34° C, 68° C, 86°C, and 91°C. 

2. Estimate, using piecewise linear interpolation, the viscosity of water when T = 
34° C, 68° C, 86° C, and 91° C. At what temperature is the viscosity 1.000x 1073 
N-s/m*? 
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3. Estimate, using piecewise linear interpolation, the kinematic viscosity of water 
when T = 34° C, 68° C, 86° C, and 91°C. At what temperature is the kinematic 
viscosity 1.000 x 107° m?/s? 

4. Estimate, using piecewise linear interpolation, the surface tension of water when 


T = 34°C, 68°C, 86°C, and 91° CG. At what temperature is the surface tension 
0.0650 N/m? 


5. Estimate, using piecewise linear interpolation, the vapor pressure of water when 
T = 34°C, 68°C, 86°C, and 91°C. 


6. Estimate, using piecewise linear interpolation, the sound speed of water when 
T = 34°C, 68°C, 86°C, and 91°C. 


For Exercises 7 through 9, use the values given below for the temperature, T, pressure, 
p, and density, p, of the standard atmosphere as a function of altitude. These data were 
drawn from Table A.6 in Frank White, Fluid Mechanics: 


z(m) 0 500 1000 1500 2000 2500 3000 
T (Kis) 288.16 284.91 281.66 278.41 275.16 271.91 268.66 
p (Pa) 101,350 95,480 89,889 84,565 79,500 74,684 70,107 


p(ke/m*) 1.2255 1.1677 1.1120 1.0583 1.0067 0.9570 0.9092 


7. Estimate, using piecewise linear interpolation, the temperature of the standard 
atmosphere at an altitude of z = 800 m, 1600 m, 2350 m, and 2790 m. At what 
altitude is the temperature of the standard atmosphere 273.1 K? 


8. Estimate, using piecewise linear interpolation, the pressure of the standard at- 
mosphere at an altitude of z = 800 m, 1600 m, 2350 m, and 2790 m. 


9. Estimate, using piecewise linear interpolation, the density of the standard at- 
mosphere at an altitude of z = 800 m, 1600 m, 2350 m, and 2790 m. At what 
altitude is the density of the standard atmosphere 1.1000 kg/m? 


10. A publisher of mathematics textbooks needs a table of values for the common 
logarithm. function (i.e., the base 10 logarithm) for one of its new precalculus 
textbooks. Each entry in the table is to be accurate to six (6) decimal places. 
The table must include entries for uniformly spaced values of x ranging from 
1.0 to 10.0, and the increment between x-values must be small enough so that 
linear interpolation between any two consecutive entries in the table introduces 
an error of less than 107°. What is the maximum possible increment that can 
be used in the construction of this table? What increment would you suggest for 
the construction of the table? 

11. Suppose that a table lists the values of the tangent function for angles ranging 
from 0.0° to 45.0° in increments of 0.5°. What is the largest error that we would 
introduce by performing linear interpolation between successive values in this 
table? 

12. Suppose that a table lists the values of the inverse sine function for inputs ranging 
from 0.000 to 0.950 in increments of 0.001. What is the largest error that we 
would introduce by performing linear interpolation between successive values in 
this table? 
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5.6 CUBIC SPLINE INTERPOLATION 


Although simple to implement and ideal for hand calculations, a major disadvantage 
associated with piecewise linear interpolation is that the interpolating function 
generally will not be differentiable at the interpolating points. In many instances, 
however, physical considerations will eau that the interpolating function be 
continuously differentiable. 

To achieve more smoothness (and greater accuracy) from the interpolating 
function, higher-degree polynomial pieces must be used. The most common choice 
is cubic polynomials. These cubic polynomial pieces can be combined in different 
ways to produce the overall interpolating function. Here, we shall develop the tech- 
nique of cubic spline interpolation, which obtains the highest degree of smoothness 
from the piecewise interpolating function. In the next section, we will consider a 
different approach to piecewise cubic interpolation which utilizes both function and 
derivative data. 


The Cubic Spline Interpolant 
Let f be a function defined on the interval [a,b], and let 
G=%o <2) <a <-- S Ap] <2, = 


be the n+ 1 distinct points at which f is to be interpolated. Recall that the 2x; 
divide [a, 6] into n subintervals, referred to as a partition of [a, ]. 


Definition. A Cusic SPLINE INTERPOLANT of f relative to the partition 
a=% <0) <2o <--+ < In < Tp =) 


is a function s that satisfies 

(1) on each subinterval [z;,2;41], 7 = 0, 1, 2, ..., n —1, 8 coincides with 

the cubic polynomial 
s(x) = 8(a) = ay + bylaw — 23) + cj(@ — 25)” + dj(a —a,)°; 

(2) 
(3) 
(4) s' is continuous on [a, 8]; 
(5) s 


s interpolates f at zo, 21, 22, .--, In} 
8 is continuous on [a, d]; 


’ is continuous on {a, 6). 


Though this definition clarifies the important characteristics of a cubic spline, 
it does not provide enough information to completely determine the interpolating 
function. The function s is composed of n different cubic polynomials, each with 
four coefficients, so there are a total of 4n unknowns. Interpolation provides n + 1 
equations. Continuity of the spline and its first two derivatives contribute an addi- 
tional 3(n—1) = 3n—3 equations—remember that continuity applies at the interior 
points 21, Zo, 3, ..., Zn Only. The definition of a cubic spline therefore provides 
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n+14+3(n —1) = 4n — 2 equations. To completely determine the interpolating 
function, two more equations will have to be specified. 

Below, we will discuss two different types of additional constraints, or bound- 
ary conditions: the not-a-knot boundary conditions and the clamped (or complete) 
boundary conditions. A third type of boundary condition, the natural (of free) 
boundary conditions, will be explored in the exercises. For a review of other possi- 
ble boundary conditions, see Ueberhuber [1]. 

Fortunately, even after two more equations have been specified, the full system 
of 4n equations in 4n unknowns can be solved very efficiently. To see how this is 
done, let’s start by writing out the equations which follow from the definition. 


Interpolation: 

8;(%j) = a; = f(x;), j =0,1,2,...,7 
Continuity of spline: 

A541 = aj tbjhy +ojh? +djh}, 7 =0,1,2,....n-2 
Continuity of spline derivative: 
bj4y =; + 2cyhy +3djh3, 7 =0,1,2,...,.n-2 

Continuity of spline second derivative: 

Cj41 = C7 + 3d5h;, j=0,1,2,...,.n—-2 


To simplify the equations, we have defined h; = 241 — xj. Note that we are using 
(m = f(an) = f(b), which is a slight extension to the notation introduced in the 
definition of a cubic spline interpolant. This extension will make it easier to express 
the equations we are about to develop. 

The interpolation conditions directly provide the values for the a;, thereby 
removing one-quarter of the unknowns. Next, solve the equation for the continuity 
of the spline second derivative for dj: 


Cy4y 7 Cj 
d; = g+ 2 . 1 
Substituting this expression into the equations for the continuity of the spline and 
its first derivative gives 
jel 7G 2 
J 

: (2) 
C41 + 2c; 2 

3 j 


Ojo, = Oy + byhy + eshi + 
= a; + bshy F 
and 


b541 = b; ae 2ejhy + (ej41 a c;)hj 
= by + (cj4 + cy)hy. 
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Finally, solve equation (2) for b;: 


bp x MEHL OH _ PCy + C41 


and substitute the result into equation (3). After performing some algebraic ma- 
nipulation and shifting the subscripts down by one, we arrive at 


Aga1eg—a + 2(Aga1 + hy )ej + hyejn1 = (aps: — aj) — aC ~aj-i). (5) 
J jl 
This equation holds for 7 = 1, 2, 3,..., 7 —1 and forms the basis for a tridiagonal 
system of equations for determining the c;. The equations for 7 = 0 andj =n 
depend on the type of boundary conditions which are being applied. 
Regardless of the choice of boundary conditions, computing the coefficients of 
a cubic spline interpolant is a two-step process. First, the linear system for the c; 
must be solved. Recall that an efficient algorithm for solving tridiagonal systems 
was developed in Chapter 3 and is listed in Appendix B for convenience. Once 
this has been done, equation (1) is used to compute the d; and equation (4) is 
used to compute the b;. Remember that the a; are given by the function values, 
f(z;). Evaluation of a cubic spline interpolant, as with any piecewise function, 
is also a two-step process. Based on the value of the independent variable, the 
polynomial piece which needs to be evaluated must first be determined. Once the 
polynomial piece has been selected, the value of the interpolant at the given value 
of the independent variable can then be calculated. 


Not-a-Knot Boundary Conditions 


When no information other than the value of f at each interpolating point is avail- 
able, it is recommended that the not-a-knot boundary conditions be applied. These 
conditions require that s’” be continuous at z = 2, and = 2,_,. In terms of the 
spline coefficients, this translates to 


do = ay and dn—2 = dy-}- 


Using (1), and rearranging terms, these equations can be expressed in terms of the 
Cj as 


hiecg - (ho + hye, + hece = 0 (6) 
Pn—1Cn—~2 — (An-2 + An—1)¢n—1 + hn-2en = 0. (7) 


Unfortunately, (6) and (7) do not preserve the tridiagonal structure of (5). 
This situation, however, can be remedied as follows. Solve equation (6) for co and 
equation (7) for c,. This gives 


Co = +2) ae ae (8) 


fig Pe EN lp (1+ mt) oa | (9) 
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Now, substitute co from (8) into (5), for 7 = 1, and group terms to obtain 


(30-4 2m + 72) 0+ poets ari ) a 10 
0 J hi, 1 1 hi aa a2 — ay ha a1 — ag). (10) 


Proceed in a similar manner with the expression for c, from (9) substituted into 
(5) for 7 =n —1 to produce 


Ae 5 h2 
(naa = fi ) Cn—at (3h0-1 + 2ha-2 + | Cn-1 
m—2 n—-2 
3 3 
= ion (Qn oa Gn—-1) = 


i (@n_) —On-2). (11) 
Tr 2 
Equation (5), for 7 = 2,3,4,...,n — 2, together with equations (10) and (11) con- 
stitute a complete tridiagonal system for the coefficients c,, C2, C3, ..., Cn—1- Since 
each h; is positive by definition, the coefficient matrix for this linear system is 
strictly diagonally dominant. Hence, there is always a unique solution for the c;. 


EXAMPLE 5.16 Emittance of Tungsten as a Function of Temperature 


The table below gives experimental values for the emittance of tungsten as a func- 
tion of temperature. 


Temperature (K) Emittance Temperature (K} Emittance 
300 0.024 800 0.083 
400 0.035 900 0.097 
500 0.046 1000 0.111 
600 0.058 1100 0.125 
700 0.067 


This problem was previously considered in Section 5.1, where it was shown that 
the behavior of the eighth-degree interpolating polynomial was not consistent with 
the data, in particular at the ends of the domain. Here, we develop the not-a-knot 
cubic spline interpolant (the cubic spline with not-a-knot boundary conditions) for 
the data. With h; = 100 for each j, the system of equations for the cj (1 < 7 <7) 
takes the form 


[ 600 0 c 0 
100 400 100 c2 0.001 
100 400 100 c3 0.003 
100 400 100 cy | =0.03| 0.007 
100 400 100 5 0,002 

100 400 100 | | e 0 

0 600 |; l eo 0 


Solving this linear system and then applying equations (8) and (9) to compute cp 
and cg, followed by the application of equations (1) and (4), we obtain the complete 
set of spline coefficients: 
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700 
Temperature,K 


Figure 5.14 Comparison of not-a-knot cubic spline and eight-degree 
interpolating polynomial for emittance of tungsten as a function of tem- 
perature. Data values are indicated by asterisks. 


a; b; Cj d; 
0.024 0.00012256410 -0.00000018846 0.00000000063 
0.035 0.00010371795 0 0,00000000063 
0.046 0.00012256410 0.00000018846  -0.00000000214 
0.058 0.00009602564 -0.00000045385 0.00000000394 
0.067 0.00012333333 0.00000072692  -0.00000000360 
0.083 0.00016064103 -0.00000035385 0.00000000147 
0.097 0,00013410256 0.00000008846 -0.00000000029 
0.111 0.00014294872 0 -0.00000000029 


The graph of the not-a-knot cubic spline is given in Figure 5.14. The graph of the 
eighth-degree interpolating polynomial is also shown for comparison. Clearly, the 
behavior of the not-a-knot cubic spline is more consistent with the underlying data. 


EXAMPLE 5.17 Mean Activity Coefficient of Silver Nitrate 


The mean activity coefficient at 25°C for silver nitrate, as a function of molality, 
is given in the table below. Estimate the mean activity coefficient for a molality of 
0.032 and for a molality of 1.682. 


Molality 0.005 0.010 0.020 0.050 0.100 0.200 0.500 1.000 2.000 
Coefficient 0.924 0.896 0.859 0.794 0.732 0.656 0.536 0.430 0.316 
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Figure 5.15 Not-a-knot cubic spline for mean activity coefficient of 
silver nitrate as a function of molality. Data points are indicated by 
asterisks. 


Using an eight-degree interpolating polynomial, the estimates for the mean activity 
coefficient are 0.831 for a molality.of 0.032 and —216711.827 for a molality of 1.682. 
The former value is reasonable, while the latter clearly is not. 

The not-a-knot cubic spline computed from this data is shown in Figure 5.15. 
Evaluating the spline at a molality of 0.032 gives an estimate for the mean activity 
coefficient of 0.828. For a molality of 1.682, the estimate for the mean activity 
coefficient is 0.349. 


Clamped Boundary Conditions 


If the values f(a) and f’(b) are known, then it is better to apply the clamped (or 
complete) boundary conditions s’(a) = f(a) and s’(b) = f’(6). Starting with z = a, 
we find f’(a) = s’(a) = 39(a) = bp. Equation (4) with 7 = 0 allows us to write this 
condition in terms of the cj: 


_ aa 2+ 


ho, 
hg Bees 


f(a) 


or 


3 / 
2hoco + hoe: = ig — a9) — 3f'(a). (12) 


At v2 =, f’(b) = s'(b) = s/,(b) = by. Using equation (3) to express b, in terms of 
bn—1, Cn—1 and ey, followed by equation (4) to rewrite 6,1 in terms of dn—1, Gn; 
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Cn-1 and Cp, we obtain the equation 


3 
hn=1 


An -1Cn-1 + 2hn—1Cn a 3f'(b) a (Gn = @n-1). (13) 
Combining equation (5) for j = 1, 2, 3, .... m — 1 with equations (12) and (13) 
produces a complete tridiagonal linear system for determining the c;. As with 
not-a-knot boundary conditions, the coefficient matrix associated with clamped 
boundary conditions is strictly diagonally dominant; hence, there is again always a 
unique solution for the c;. 


EXAMPLE 5.18 A Clamped Cubic Spline 


Let’s determine the clamped cubic spline for the following data: 


ze f(z) F(z) 
~1.0 0.00000 2.71828 


—0.5 0.82436 
0.0 1.00000 
0.5 0.90980 


1.0 0.73576 —0.36788 


Note that this data is taken from the function f(z) = (2 +1)e~*. Since hj = 0.5 
for each j, the tridiagonal system for the c; takes the form 


1 0.5 Co ~—3.20868 
05 2 0.5 Cy —3.89232 
0.5 #2 0.5 cg | = | —1.59504 

0.5 2 0.5 C3 —0.50304 

0.5 1 C4 —0.05940 


Solving this system and then applying equation (1) to compute the d; and equa- 
tion (4) to compute the b;, we obtain the complete set of spline coefficients: 


a; ; Cj dj 
0.00000 2.71828000000 -2.62214571429 0.96605142857 
0.82436 0.82067285714 -1.17306857143 0.46856571429 
1.00000 -0.00097142857 -0.47022000000 0.22272571429 
0.90980 -0.30414714286 -0.13613142857 0.09653142857 


The graph of the clamped cubic spline is shown at the top of Figure 5.16. To the 
resolution of the plotting device, the graph of the cubic spline is indistinguishable 
from the graph of f. In fact, || f — 8|loo 0.0015. 

The error in the clamped cubic spline, as a function of z, is shown in the 
bottom graph of Figure 5.16. For comparison, the error in the not-a-knot cubic 
spline is also displayed. Clearly, when the derivative values at the ends of the 
domain are available, the clamped cubic spline is superior to the not-a-knot cubic 
spline. 
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Figure 5.16 (Top graph) Clamped cubic spline for data taken from 
fie) = (e+ 1)e7*. Data points are indicated by asterisks. (Bottom 


graph) Comparison of error in clamped cubic spline and not-a-knot cubic 
spline. 


The clamped cubic spline satisfies an interesting property related to the cur- 
vature of a function. For a general function, the curvature at a point is defined 


by 
Lf" (2)| 
(1+ [F'(2)P)9/?" 


which is commonly linearized to «(a:) = [f(a)}|. The quantity ful’ Pda can 
therefore be viewed as a crude measure of the total curvature over an interval. 
We will now prove that, in this measure, any smooth interpolating function which 
satisfies clamped boundary conditions must have a total curvature at least as large 
as that of the clamped cubic spline. This is sometimes referred to as the minimum 
curvature property of the clamped cubic spline. 


4(z) = 


Theorem. Let g be any function, continuous and twice continuously differ- 
entiable on the interval [a, 6], that interpolates f over the partition 


A= <2 <29< + < In < Bn = 


and satisfies the clamped boundary conditions g’(a) = f’(a) and g’(b) = f’(0). 


Then 
b b 
[istortars flo"? ae, 


where s is the clamped cubic spline. 
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Proof. First, observe that 
b b 
[ '@)Pae= [ee -s'@+s"@iPas 
. 4 
-[ la" (e) — 8" (a) 2 de +2 [s "(a)lg"(2) — $"(a)| dx 


he ‘(x)|? dx. 


Next, focus on the term 
[on (x}|g9" (2) — 8"(a)| dz = sf s"(x)\g" (a) — s“{x)| da. 
i= 4 Ft 


After integrating by parts twice, it follows that 
Tp 
l s(x)ig" (x) — 8"(x)| da = {8"(x)\g'(x) ~ 8'{(x}] - 8" (a) lg(@) - se 
‘ Trt 
+ [ s(a\la(a) - oo) de. 


1 


Since s is a cubic polynomial on [#:, 2:4], s{(2) = 0. Furthermore, since 
both s and g interpolate f at each x;, 


{s"(a}[g(a) — s(2))}"" =0 


Therefore, 
“b mot 
[st (aila"te) — oye = So (ayig 2) ~ 9 
a i=0 
= s"(a)[9'(2) — say], ~ s"@)i9'(@) — S'la 
= 0, 


due to the clamped boundary conditions satisfied by both s and g. Thus 


4 b b b 
[ entae= [y'e)-s"@ Paes | (o'eyPar> | [s"(w))tae 


suiice the integral of a non-negative function is always nonuegative. oO 


Error in Cubic Spline Interpolation 


We conclude the discussion of cubic spline interpolation with a theorem on the error 
associated with the clamped cubic spline. For a proof of this result, see de Boor [3], 
Hall and Meyer [3], or Schultz [4]. An error bound for the not-a-knot cubic spline, 
also of fourth order, can be found in de Boor [5] or Beatson [6]. 


402. Chapter 5 Interpolation (and Curve Fitting) 


Theorem. Let f be continuous, with four continuous derivatives, on the 
interval [a, 6], and let s be the clamped cubic spline interpolant of f relative 
to the partition 


Q=£o <2, < fg < +++ Many <a, = d. 


Then 


5 
—s(z)|< — ph! (4) 
zéla.t| f@) ~ s(2)| s 3a” ze lat] P@h 


where Ah = maxo<i<pn—1(2i41 — 2). 
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EXERCISES 


For Exercises 1 through 3, use the values given below for the temperature, T, pressure, 7, 
and density, , of the standard atmosphere as a function of altitude. This data was 
drawn from Table A.6 in Frank White, Fluid Mechanics: 


z (m) 0 500 1000 1500 2000 2500 3000 
T (K) 288.16 284.91 281.66 278.41 275.16 271.91 268.66 
p (Pa) 101,350 95,480 89,889 84,565 79,500 74,684 70,107 
po (kg/m?) 1.2255 1.1677 1.1120 1.0583 1.0067 0.9570 0.9092 


1. Using the not-a-knot cubic spline interpolant, estimate the temperature of the 


standard atmosphere at an altitude of z = 800 m, 1600 m, 2350 m, and 2790 m. 
At what altitude is the temperature of the standard atmosphere 273.1 K? 


2. Using the not-a-knot cubic spline interpolant, estimate the pressure of the stan- 
dard atmosphere at an altitude of z = 800 m, 1600 m, 2350 m, and 2790 m. 

3. Using the not-a-knot cubic spline interpolant, estimate the density of the stan- 
dard atmosphere at an altitude of z = 800 m, 1600 m, 2350 m, and 2790 m. At 
what altitude is the density of the standard atmosphere 1.1000 kg/m?? 
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Exercises 4 through 9 are based on the following data for the density, p, viscosity, y, 
kinematic viscosity, v, surface tension, Y, vapor pressure, py, and sound speed, a, of 
water as a function of temperature. This data was drawn from Tables A.1 and A.5 in 
Frank White, Fluid Mechanics: 


T p bb v es 5 De a 
(°C) (kg/m?) (x1073.N-s/m?) (x107> m?/s) (N/m) (kPa) (m/s) 
0 ~~: 1000 1.788 1.788 0.0756 0.611 1402 
10 1000 1.307 1.307 0.0742 1.227 1447 
20 998 1.003 1.005 0.0728 2.3387 1482 
30 996 0.799 0.802 0.0712 4.242 1509 
40 992 0.657 0.662 0.0696 7.375 1529 
50 988 0.548 0.555 0.0679 12.34. 1542 
60 983 0.467 0.475 0.0662 19.92 1551 
70 978 0.405 0.414 0.0644 31.16 1553 
80 972 0.355 0.365 0.0626 47.35 1554 
90 965 0.316 0.327 0.0608 70.11 1550 
100 958 0.283 ~ 0.295 0.0589 101.3 1543 
4. Using the not-a-knot cubic spline interpolant, estimate the density of water when 


10. 


T = 34°C, 68°C, 86°C, and 91°C. 


. Using the not-a-knot cubic spline interpolant, estimate the viscosity of water 


when T = 34° C, 68° C, 86°C, and 91°C. At what temperature is the viscosity 
1.000 x 107? N-s/m?? 


. Using the not-a-knot cubic spline interpolant, estimate the kinematic viscosity 


of water when T' = 34° C, 68° C, 86° C, and 91°C. At what temperature is the 
kinematic viscosity 1.000 x 107° m?/s? 


. Using the not-a-knot cubic spline interpolant, estimate the surface tension of 


water when T = 34°C, 68°C, 86°C, and 91°C. At what temperature is the 
surface tension 0.0650 N/m? 


. Using the not-a-knot cubic spline interpolant, estimate the vapor pressure of 


water when T' = 34°C, 68°C, 86°C, and 91°C. 


. Using the not-a-knot cubic spline interpolant, estimate the sound speed of water 


when T' = 34° C, 68° C, 86°C, and 91° C. 


Consider the following data set: 
£ 0.0 0.5 1.0 1.5 2.0 
y 0.500000 1.425639 2.640859 4.009155 5.305472 
y’ 1.500000 2.305472 


(a) Construct the not-a-knot cubic spline for this data set. 
(b) Construct the clamped cubic spline for this data set. 
(c) The data for this problem is taken from the function y = (x + 1)? — 0.5e*. 


Plot the error in each of the splines from parts (a) and (b) as a function 
of a. Which spline produced the better results? 


404 Chapter5 — Interpolation (and Curve Fitting) 


11. Repeat Exercise 10 for the data set 


z 1.0 1.5 2.0 2.5 3.0 
y 0.00C0 0.608198 1.386294 2.290727 3.295837 
y’ 1.000000 2.098612 


which is taken from the function f(z) = zlnz. 


12. Repeat Exercise 10 for the data set 


x 0.00 0.25 0.50 0.75 1.00 
y 0.006000 0.176777 0.500000 0.530330 0.000000 
y’ 0.000000 —3.141593 


which is taken from the function f(z) = xsin(rz). 


13. Experimentally determined values for the partial pressure of water vapor, Da, 
as a function of distance, y, from the surface of a pan of water are given below. 
The derivative of the partial pressure with respect to distance is estimated to be 
—0.0455 atm/mm when y = 0 and 0 atm/mm when y = 5. Estimate the partial 
pressure at distances of 0.5 mm, 2.1 mm and 3.7 mm from the surface of the 
water using a clamped cubic spline. 


y (mm) 0 1 5 3 4 5 
pa (atm) 0.100 0.065 0.042 0.029 0.022 0.020 


Natural Boundary Conditions 

Another set of boundary conditions that can be used when no other information is 
available about f is the natural (or free) boundary conditions s”(a@) = s”(b) = 0. 
Since s“(a) = sG(a) = co and s”(b) = s¥(b) = én, the natural boundary conditions 


immediately translate to 


to=O and ec, =0. 


Combining these two equations with equation (5) for j = 1, 2, 3,...,—1 provides a 
complete linear system for determining the c;. The coefficient matrix for this system 
is tridiagonal and strictly diagonally dominant. If f’(a) = f(b) = 0, the natural 
cubic spline has a fourth-order error bound (see Birkhoff and de Boor [7]); otherwise, 
the natural cubic spline produces errors that are only second-order near the boundaries 
(see de Boor [2]). Exercises 14-19 deal with the natural cubic spline. 


14. Determine the natural cubic spline for the data in the example “A Clamped 
Cubic Spline.” Compare the error in the natural cubic spline to that of the 
not-a-knot cubic spline. 


15. Determine the natural cubic spline for the data in Exercise 10. Compare the 
error in the natural cubic spline to that of the not-a-knot cubic spline. 


16. Determine the natural cubic spline for the data in Exercise 11. Compare the 
error in the natural cubic spline to that of the not-a-knot cubic spline. 


17. Determine the natural cubic spline for the data in Exercise 12. Compare the 
error in the natural cubic spline to that of the not-a-knot cubic spline. 
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18. Determine the natural cubic spline for the following data sets. In each case, 
compare the natural cubic spline with the not-a-knot cubic spline. 
(a) viscosity of water (Exercise 5) 
(b) vapor pressure of water (Exercise 8) 
(c) sound speed of water (Exercise 9) 
(d) pressure of the standard atmosphere (Exercise 2) 
(e) density of the standard atmosphere (Exercise 3) 
19. Show that the natural cubic spline satisfies the following minimum curvature 


property: Let g be any function, continuous and twice continuously differentiable 
on the interval [a, 6], which interpolates f over the partition 


@=% <2) < 42 <+-- Sony < an = b. 


Then r ; 
/ [s"(n)|? de < / ia" (a)[? ae, 


where s is the natural cubic spline. 


5.7 HERMITE AND HERMITE CUBIC INTERPOLATION 


To this point, all interpolating data has consisted of function values only. In this 
section, derivative information will be incorporated into the interpolating polyno- 
mial. When derivative values are included in the construction of the interpolating 
polynomial, the graph of the polynomial will not just intersect the graph of the 
function being interpolated, but will touch, or “kiss,” it. Because the word osculate 
is a synonym for kiss, interpolation with derivative values is known as osculatory 
interpolation. 

Rather than develop general osculatory interpolation, which is of limited use 
in practice, we will focus on the most important special case, that of Hermite inter- 
polation. In the first half of this section, the Hermite interpolant will be defined, 
the computation of the Newton form will be described and the basic existence and 
uniqueness and error theories will be developed. Hermite cubic interpolation, which 
is another form of piecewise cubic interpolation, will then be presented in the latter 
half of the section. 


Hermite Interpolation 


Let %o, 41, Z2,..., Zn be n+1 distinct points at which the function f and its first 
derivative are defined. In Hermite interpolation, the function value, f(z;), and the 
value of the first derivative, f’(x;), are known at each interpolating point. Since 
there are a total of 2n+2 data values, the objective is to determine a polynomial, P, 
of degree at most 2n + 1, that satisfies 


P(a) = f(ti) and P'(ai) = f' (a4) 


for each i = 0, 1, 2,..., . The next theorem justifies referring to this function as 
the Hermite interpolating polynomial. 
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Theorem. Let 29, 21, £2, ..., Zn be n+ 1 distinct points on the interval 
[a, b] and let the function f and its first derivative be defined at each of these 
points. Then there exists a unique polynomial, P, of degree at most 2n + 1 
such that 


P(zi) = f(ai) and P"(xi) = f'(a:) 
for each i =0, 1, 2,...,n 
Proof. To establish existence, we will construct the Lagrange form of the 


Hermite interpolating polynomial. Let L,,; denote the Lagrange polynomials 
developed in Section 5.1, and define 


H,(x) = [1 —2L;, ;(va)(@ - ea)ILA (2) 
Hj(x) = (2 — 24) Lh (2). 


Note that each H; and 4; is a polynomial of degree 2n + 1. Furthermore, it 
is straightforward to show that (see Exercise 1) 


fl, i= Ppa S on 
Hi(a;) = { 0, otherwise Hi (a3) = 0 
7 _ 27, 1, i= j 
Ay{x;) =0 H; (23) = { 0, otherwise 


Hence, H; is associated with the function value at x = 2;, and Hi; is associated 
with the derivative value. Now consider the polynomial 


n= HK x) f( ri) )+ 5 °F; f' (a). 
i=0 


4=0 


Using the properties of H; and Ai;, it follows that 


P(aj) = = Hy (25) f (ai) + S_ Halas) f’ (2s) = f(s) 


i=0 i=0 
and 
nr 
P'(x5) = ym (ay) f(a) + 30 Ay(ay) f(a) = F(a5) 
4=0 
for each 7 = 0, 1, 2,..., n. P therefore interpolates all of the function and 


all of the derivative values. 
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To establish uniqueness of the Hermite interpolating polynomial, sup- 
pose that Q is a polynomial of degree at most 2n +1 that interpolates the 
function and derivative values of f at each a; with P#@Q. Let R= P-Q. 
Then 

R(@:) = P(*2) — Q(2i) = f(a) — f(xi) = 0, 
Riazi) = P’(as) — Q'(as) = fas) — f’ (we) = 0 
for each i = 0, 1, 2,...,. This implies that each x; is a root of R of multi- 
plicity at least 2. Therefore, R is a polynomial of degree at most 2n + 1 with 
at least 2n + 2 roots. The Fundamental Theorem of Algebra then guaran- 
tees that R = 0, or P = Q. Hence, the Hermite interpolating polynomial is 
unique. a 


Though the Lagrange form of the Hermite interpolating polynomial, 
nm nr 
P(z) = 5° Ayla) f(s) + >) H(z) f'(za), 

7=0 i=0 
is useful for theoretical purposes, for practical computations, it is better to develop 
the Newton form. To do this, first construct the sequence, z;, of length 2n + 2 

according to the rule 

45 = 2g div 2)) 
where div denotes integer division. Next, compute a divided difference table based 
on the sequence of z values and the corresponding values of the function f. Every 
other entry in the column of first divided differences will be of the form f[z,, x] for 
some t. To handle these entries, recall from Section 5.3 that when f is differentiable 
on [a,)] and 2;,2; € [a,b] with 2; # z;, there exists € between x; and a; such that 


F {zi 25] cz, P(E) 


Letting 2; — 2;, it follows from the squeeze theorem that € > x;. We will therefore 
define f[x;:, 24] = f’{x;). This is where the given values of the first derivative enter 
into the divided difference table. The remaining entries in the table are computed 
as usual. Once the table has been completed, the Newton form of the Hermite 
interpolating polynomial is given by 


2n+1 k-1 
P(x) = ee F 20, 21, Z2,---) Zk] Te ~ -)) 
k=0 


#=0 

For a proof of this result, see Powell [1]. 

EXAMPLE 5.19 Constructing the Newton Form of the Hermite 
Interpolating Polynomial 


Consider the function f(z) = xe~*. We will construct the Newton form of the 
Hermite interpolating polynomial for this function using the data in the table below. 
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vi f(s) f'(as) 
0 0 1 

2 2 =e? 
4 4e-4 —3e-4 


s 


The divided difference table for this data is shown in Figure 5.17(a). Note the rep- 
etition of the interpolating points in the first column, the corresponding repetition 
of the function values in the second column and the placement of the derivative 
values in the third column. All other entries in the table were computed using the 
standard divided difference formula. Using the values from the top of each column 
in the divided difference table, we arrive at the Hermite interpolating polynomial 


“44 de-? — 1 
eaeee =o 7 2? (x — 2)? 


A plot of this polynomial and the function f(x) = ze~* is shown in Figure 5.17(b). 
The locations of the interpolating points are indicated by the circles. Note the way 
the Hermite interpolating polynomial matches both the height and the slope of the 
interpolated function at each circle. 


EXAMPLE 5.20 Data Analysis for the Spread of an Epidemic 


Suppose that a mathematical model for the spread of an epidemic produces the 
following estimates for the number of people who have died as a result of the epi- 
demic, D(t), and the rate at which people are dying, D’(t). Here, time is measured 
in weeks. Using this data, we wish to generate a table which shows the number of 
dead at half-week increments. 


t D(t) Dit) 
~ 0.000000 0.000000 600.000000 
0.750000  445.903683 573.579644 
1.500000  842.695315 477.074216 
9.085600 1095.211197 384.947629 
2.676193  1295.955674 296.576145 
3.219694 1437.602773 226.796410 
3.748513 1542.363644 171.475176 
4.279179 1621.280769 127.808738 
4.821254 1680.890649  93.728061 
5.000000 1696.803710  84.473801 


The graph of the Hermite interpolating polynomial constructed from this data is 
shown in Figure 5.18. Evaluating the polynomial at t = 0, 0.5, 1.0, 1.5, 2.0, 2.5, 
3.0, 3.5, 4.0, 4.5, and 5.0 produces the desired values of D(t): 
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(a) 
z First Second Third Fourth Fifth Sixth 
© FOO py 0i 
O f(0)=0 et (e? -1)/2 [-3e* 41/4 
2 f(2)=26* re)=-e -e* (et +eVa (e* + 4e? -1)/16 (-96" =e +1) /64 
2 f(2)=2e? Hebi gt e (te +e) -e"/2 
4 f(4)=4e* f(4)=-3e4 (-Se* +e7)/2 
4 f(4)=4e* 

(®) ie — fW=xe* 

0.35} — _ Pw 


ta) 0.8 1 1.5 2 2.5 3 3.5 4 
x 


L 


Figure 5.17 Hermite interpolation for the function f(z) = we~* at 
z= 0,2 = 2, and x = 4. (a) Divided difference table; (b) plot of 
interpolating polynomial, P(x), and f(z). 


ty D(t;) 
0.0 0.000000 
0.5 299.782776 
1.0  §86.071099 
1.5  842.695315 
2.0 1061.682772 
2.5 1241.495466 
3.0 1384.888210 
3.5 1496.766805 
4.0 1582.655893 
4.5 1647.810593 
5.0 1696.803710 
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1, measured in weeks 


Figure 5.18 Graph of Hermite interpolating polynomial generated 
from the data for the spread of an epidemic. Data values are indicated 
by the asterisks. 


The last issue to address with regard to Hermite interpolation is interpolation 
error. If we replace the auxiliary function, g, in the proof of the error theorem in 
Section 5.1 by 


a(t) = f(t) - PQ) -[f pe 2, 


where P is interpreted as the Hermite interpolating polynomial, then we find 


Theorem. Let f be continuously differentiable 2n + 2 times on [a,b], and let 
40, Z1, £2, .++) Ln be n+ distinct points from [a,b]. Then for each « € [a,b], 
there exists a & € [a,b] such that 


(an+2)(¢ we 
Ha) = Pe) + ay Tle 5 


where P is the Hermite interpolating polynomial. 


Hermite Cubic Interpolation 


An alternative scheme for combining cubic polynomial pieces into an interpolating 
function produces the Hermite cubic interpolant. 
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Definition. The HERMITE Cusic INTERPOLANT of f relative to the parti- 
tion ; 
Q@=2%o <2) < 29 <6 <n <2, =) 


is a function s that satisfies 


(1) on each subinterval [z;,2541], 7 =0, 1, 2,...,m—1, 8 coincides with a 
cubic polynomial s;(z); 
(2) s interpolates f and f’ at x9, 21, ©2,..-, Zn} 


(3) s is continuous on |a, }]; 
(4) 8" is continuous on [a, d. 


Although the Hermite cubic interpolant has less smoothness than the cubic 
spline, determining the polynomial pieces for the Hermite cubic requires less work. 
Focus on the function s,;(z). Interpolation of f and its first derivative at ¢ = 2; 
requires 

s3(zj) = f(t3) and 55(a3) = f’(a;). 


Combining the continuity of s and s’ at z = 2;4; with the interpolation of f and 
its first derivative at x = j4, then requires 


83 (2541) = $541 (%j41) = F(zz41) 


8 (2j41) = 8541 (2541) = f'(2p41)- 

Hence, s;(x) is a third-degree polynomial that interpolates both f and f’ at z = x; 
and at 2 = 2341. However, earlier in this section we established that the Hermite 
interpolating polynomial is the unique polynomial of degree at most three that 
interpolates f and f’ at z = 2; and at x = 2;4;. 8j(x) must therefore be the 
Hermite interpolating polynomial. 

Having established that s;(x) is the Hermite interpolating polynomial, we can 
immediately write down that 


$3(Z) = Ay 3 (2) f (05) + Aa g41() f (e541) +A (2) F (@5) + Aga (2) Ff (24a): (1) 


where 5 
Ay, (2) = fi -2 2S | (| 
Ly — Uep1 | \ LF Vis 
2 
e~ Bi41 L— 25 
Ay jai(z) = [1-2 i \( ) 
vate) = [1-222 | (EFL 
L— £j41 
Ay; (z) = (2 — 2;) (2) 
13( ) ( ¥] Lj — Lj41 
and 


ft Se <a)4) (22). 


ee amd 
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Note that this is just the Lagrange form of the Hermite interpolating polynomial 
over the interval (z;,254,]. We can simplify equation (1) slightly by introducing 
the translated and scaled variable € = (# — x;)/h;, where hj = 2541 — £3, and the 
functions $(€) = (1 + 2¢)(1 — )? and (é) = &(1 — €)?. The functions @ and w are 
usually called shape functions. 

After some minor algebraic manipulation, the details of which are left as an 
exercise, it can be shown that 


Ay 5(z) = 6(€), Higar(e) = 1 ~ Af), Hage) = hyp(6), and 
Ay 341(2) = —Ayp(1 ~ €). 
Substituting these expressions into equation (1) yields 
83(a) = f(aj4i) + () [f (es) ~ Faes4a)] + hy WOE) Sey) ~ OO > OF ary 41)]- 


Note that in this form, only three function evaluations — #(€), ~(€) and #(1 - €) 
~ are needed to evaluate s;, as oppased to the four function evaluations—A, ;(2), 
Hy 541(2), Hi,;(2), and Hy 341(2)—required by equation (1). 


EXAMPLE 5.21 Data Analysis for the Spread of an Epidemic—Revisited 


Above, a single Hermite interpolating polynomial was used to take the following 
data and produce a table of values for D(t) at evenly spaced values of ¢ in increments 
of 0.5. 
t D¢t) D'(t) 

0.000000 0.000000 600.000000 

0.750000 445.903683 573.579644 

1.500000 842.695315  477.074216 

2.085600 1095.211197 384.947629 

2.676193 1295.955674 296.576145 

3.219694  1437.602773 226.796410 

3.748513 1542.363644 171.475176 

4.279179  1621.280769 127.808738 

4.821254 1680.890649 93.728061 

5.000000 1696.803710  84.473801 


Here, we will use a Hermite cubic interpolating polynomial to produce the desired 
table. 

For t = 0, the value of D is already known: 0.0. For t = 0.5, we need to 
evaluate the polynomiai 89 — since 0.5 is between the first and second values in the 
first column above. By direct calculation, we find 


ho = 0.750000 
€ = 0.666667 

(0.666667) = 0.259259 
4p(0.666667) = 0.074074 
4p(0.333333) = 0.148148 
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and 


89(0.5) = 445.903683 + 0.259259(0.000000 — 445.903683)-+ 
0.750000(0.074074 - 600.000000 — 0.148148 - 573.579644) 
= 299.901286. 


Next, for t = 1.0, we need to evaluate s;. In this case, we find 


hy = 0.750000 

& = 0.333333 
(0.333333) = 0.740741 
4p(0.333333) = 0.148148 
(0.666667) = 0.074074 


and 


31 (1.0) = 842.695315 + 0.740741(445.903683 — 842.695315)+ 
0.750000(0.148148 - 573.579644 — 0.074074 . 477.074216) 
= 986.002536. 


Continuing in this manner, we complete the table below. Compare these values 
with those obtained previously. 


ty D(t) 
0.0 0.000000 
0.5  299.901286 
1.0 586.002536 
1.5 842.695315 
2.0 1061.676605 
2.5  1241.486922 
3.0 1384.886145 
3.5 1496.767832 
4.0 1582.658182 
4.5 1647.813063 
5.0 1696.803710 


This last theorem provides an error bound for Hermite cubic interpolation. 
The proof of this theorem is similar to that of the error bound for piecewise linear 
interpolation, so the details have been left as an exercise. It may seem odd that 
the error bound for the Hermite cubic interpolant is smaller than the bound for 
the clamped cubic spline—a coefficient of 1/384 versus 5/384—given that the cubic 
spline has more smoothness than the Hermite cubic. However, remember that the 
Hermite cubic was constructed using 2n + 2 data items, as compared to then +3 
data values used to construct the clamped cubic spline. To level the playing field, 
we should allow the clamped cubic spline the same number of data values. If we 
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were to use uniformly spaced interpolating points, this would imply that the mesh 
size for the clamped cubic spline would be roughly half that for the Hermite cubic. 
We would then find that the error bound for the cubic spline would be superior. 


Theorem. Let f be continuous, with four continuous derivatives, on the 


interval [a, b|, and let s be the Hermite cubic interpolant of f relative to the 
partition 


QA=% <4) <9 <-°+< Ln] <n =). 
Then 
= 4 (4) 
ax [Y(2) — (0) < gezh* man If(e)L 


ele bj 2€[a,b] 


where h = Maxg<i<n-1 (ind = Li). 


References 


1. M. J. D. Powell, Approximation Theory and Methods, Cambridge University 
Press, Cambridge, 1981. 


EXERCISES 
1. Show that the polynomials H, and A, defined by 


Hy(x) = [1 — 20h, 4(24)(x — ai) L5,4(2) 
H,(2) = (v — 24) Lh a(x), 


where Dn; is the Lagrange polynomial associated with the point x = 2, satisfy 
the relations 


sie hon Wt Se eg ee 
Hi(x3) = { 0, otherwise Hi(z;) = 0 
f 1, i=J 
Hi(a) =0 ACES eee 
2. Let f be continuously differentiable 2n + 2 times on [a,b], and let xo, v1, 22, 
., an be n+ 1 distinct points from [a, b]. Provide the details of the proof that 
for each x € [a,b], there exists a € € (a, b] such that 


free) 


fle) = Pe)+ Bayt [Ite 


where P is the Hermite interpolating polynomial. 
3. Let f(x) =a2lnz, to = 1, and zi = 3. 
(a) Construct the Hermite interpolating polynomial for f at the specified inter- 
polating points. 
(b) Approximate f(1.5) using the polynomial from part (a), and confirm that 
the theoretical error bound holds. 
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4, Let f(z) =alnz, x9 = 1, 21 = 2, and z2 = 3. 
(a) Construct the Hermite interpolating polynomial for f at the specified inter- 
polating points. 


(b) Approximate f(1.5) using the polynomial from part (a), and confirm that 
the theoretical error bound holds. 


(c} Construct the Hermite cubic interpolant for f at the specified interpolating 
points. 

(d) Approximate f(1.5) using the piecewise polynomial from part (c), and con- 
firm that the theoretical error bound holds. 


5. Let f(z) = 2e7*, ao = 1, 21 = 2, and x2 = 3. 
(a) Construct the Hermite interpolating polynomial for f at the specified inter- 
polating points. 


(b) Approximate f(1.5) using the polynomial from part (a), and confirm that 
the theoretical error bound holds. 


(c) Construct the Hermite cubic interpolant for f at the specified interpolating 
points. 

(d} Approximate f(1.5) using the piecewise polynomial from part (c), and con- 
firm that the theoretical error bound holds. 


6. Let f(a) = Tapa? vo = —1, x1 = 0, and ze = 1. 


(a) Construct the Hermite interpolating polynomial for f at the specified inter- 
polating points. 


(b) Approximate f(—0.3) using the polynomial from part (a), and confirm that 
the theoretical error bound holds. 


(c) Construct the Hermite cubic interpolant for f at the specified interpolating 
points. 

(d) Approximate f(—0.3) using the piecewise polynomial from part (c), and 
confirm that the theoretical error bound holds. 


7. A model for the growth of an insect population predicts the following values 
for the population, P(¢), and the rate of increase in the population, P’(t), as 
functions of time. Here, time is measured in months. 

é P(t) P'(t) 
0.000000 5.000000 1.850962 
0.500000 6.008286 2.179438 
0.950023 7.050280 2.443439 
1.447286 8.323016 2.658770 
1.947286 9.682456 2.756773 
2.447286 11.056543 2.716253 
2.947286 12.376723 2.544655 
3.447286 18.584544 2.273554 
3.947286 14.641031 1.946924 
4.430434 15.502227 1.618850 
4.848017 16.121126 1.348776 
5.000000 16.319048 1.256352 
(a) Use the Hermite interpolating polynomial derived from this data to tabulate 
the population in half-week increments. 
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10. 
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(a) Data for Exercises 8, 9, 10 (b) Data for Exercises 11, 12, 13 


Time Height Velocity Time Charge Current 
(sec) (meters) (meters/sec) (sec) (coulombs) (amperes) 
0.00 0.290864 —0.16405 0.00 0.000000 0.000000 
0.02 0.284279 —0.32857 0.02 0.003293 0.249906 
0.04 0.274400 —0.49403 0.04 0.007381 0.121402 
0.06 0.260131 —0.71322 0.06 0.007887 —0.053314 
0.08 0.241472 —0.93309 0.08 0.006296 —0.080449 
0.10 0.219520 — 1.09409 0.10 0.005296 —0.015126 
0.12 0.189885 —1.47655 0.12 0.005525 0.028800 
0.14 0.160250 ~1.47891 0.14 0.006086 0.020787 
0.16 0.126224 —1.69994 0.16 0.006255 —0.002842 
0.18 0.086711 —1.96997 0.18 0.006085 —0.010721 
0.20 0.045002 —2.07747 0.20 0.005927 —0.003931 
0.22 0.000000 — 2.25010 


TABLE 5.2: (a) Data for Exercises 8, 9, 10. (b) Data for Exercises 11, 12, 13. 


(b) Use the Hermite cubic interpolating polynomial derived from this data to 
tabulate the population in half-week increments. 


(c) Use the clamped cubic spline derived from this data to tabulate the popu- 
lation in half-week increments. 
(d) Use the not-a-knot cubic spline derived from this data to tabulate the pop- 
ulation in half-week increments. 
(e) Compare the results from (a), (b), (c), and (d). 
. Table 5.2(a) gives the height and velocity of a free-falling object. 
(a) Construct the Hermite cubic interpolant for this data set. 
(b) What is the height of the object when t = 0.05 seconds? when t = 0.15 
seconds? 
(c) At what time is the object 0.20 meters above the ground? 0.10 meters above 
the ground? 
. Repeat Exercise 8 using the Hermite interpolating polynomial. 
Repeat Exercise 8 using the clamped cubic spline. 
Table 5.2(b) gives the charge on the capacitor and the current flowing through 
an RLC circuit. Recall that current is the rate of change of charge. 
(a) Construct the Hermite cubic interpolant for this data set. 
(b) What is the charge on the capacitor when ¢ = 0.05 seconds? when t = 0.15 
seconds? 
(c) At what time is the charge on the capacitor a maximum? 


. Repeat Exercise 11 using the Hermite interpolating polynomial. 


. Repeat Exercise 11 using the clamped cubic spline. 
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14. Let € = (x — 23)/h;, where hj = 2j41 — aj. Show that 


Ay 3(z) = 6(€), Ai j4i(z) =1—-¢(€), Ay j(2) = hy p(é), and 
Ay j41(2) = —hyb( ~ §), 


where ‘ 
Te jp-2 22] (2) 
Lj— Lit} \2j~ Lj41 
2 
G-2544 za 
Hy j41(2) = 1-25 Seer | (2 ) 
141 ) | Lyj41 — 25 Li41 — 23 
2 
x L— Lil 
Ay (2) =(e@—a;){ — Se 
1g(a) = @ x) (2-H ) 
e—2; \? 
oe a =u 
1g+1(2) = (@ — @541) (2). 
H(€) = (1+ 2€)(1 - 8°; 
and 


b(€) = &(1 ~ &. 


15. Prove the theorem that provides the error bound for the Hermite cubic inter- 
polant. (Use the proof of the error bound for piecewise linear interpolation in 
Section 5.5 as a model.) 


16. (a) Suppose that f is twice differentiable. Show that 


f (xi, i, 24] = Tey, 


(b) Suppose that f is n times differentiable. Show that 


ntl xis 
fla a t% + ala f a wu: 
17. Let f be a function defined on the interval [a,b], and let xo, 21, £2, ..., Zn 
be n +1 distinct points from [a,b]. For each i = 0, 1, 2,..., m, let m; be a 


nonnegative integer. The polynomial, P, of degree at most d = n+ Bier mi, 
such that 
PU) (e,) = f(a) 


for each 7 = 0, 1, 2,..., n and each k = 0, 1, 2, ..., m, is called the osculatory 
interpolating polynomial. With the Newton form of the Hermite interpolating 
polynomial as a guide and using the results of Exercise 16, construct the Newton 
form of the osculatory interpolating polynomial. 


18. Determine the osculatory interpolating polynomial for each of the following func- 
tions using the indicated amount of data at the specified points. 
(a) f(x) =2lnz, xo = 1, 2) = 2, rg = 3, M9 = 1, m1 = 0, me = 2 
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(b) f(x) = ar zo = -1, 2, = —-1/2, co = 0, 23 = 1/2, rg = 1, mo = 1, 
Mm, =m =m3=0,m4 = 1 

(b) f(z) = e7*, zo = 0, x1 = 1, xo = 2, mo = 0, m = 1, mo = 2 

19. Let f be a function defined on the interval [a, 6], and let xo, 21, 22, ..., &n 

be n +1 distinct points from [a,b]. For each i = 0, 1, 2,..., n, let m; be a 

non-negative integer. 

(a) Prove that the osculatory interpolating polynomial is unqiue. 

(b) If we suppose that f is sufficiently differentiable, what is the error associated 
with the osculatory interpolating polynomial? Prove it. 


5.8 REGRESSION 


Regression is a powerful technique for predicting the value of a dependent variable 
and for estimating the values of model parameters. This method finds application 
in business, economics, the physical and biological sciences, engineering, and more. 
Whereas interpolation is fundamentally a local procedure which forces the error to 
be zero at specific, isolated locations, regression is a global process. In regression 
analysis, the function that is being fit to the data is overdetermined, meaning that 
the number of coefficients is smaller than the number of data points. Values for these 
coefficients are computed by requiring some measure of the total approximation 
error be minimized. In this section we will deal exclusively with discrete data. 


Linear Regression 


A manufacturing firm wishes to estimate the production costs for one of its product 
lines. Over a four week period they monitor the daily production runs, tabulating 
run size and corresponding total cost. The results of this study were as follows: 


Run Size Total Cost Run Size Total Cost 
1550 $17,224 2175 24,095 
852 11,314 1213 13,474 
2120 22,186 3050 29,349 
1128 15,982 1215 14,459 
1518 16,497 2207 23,483 
786 10,536 1234 14,444 
1505 15,888 1616 18,949 
1264 13,055 3089 31,237 
1963 22,215 2033 21,384 ” 
1414 17,510 1467 18,012 


A scatter plot of the data (Figure 5.19) shows a clear, roughly linear trend. 
For simplicity, we will assume that the relationship between run size and total cost 
is exactly linear. 

There are many lines that would approximate the data reasonably well. Let 
the line that fits the data “best” be given by # = a+ bx, where the parameters a 
and b are to be determined. The criterion by which the fit of the line to the data is 
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Figure 5.19 Scatter plot of total production cost versus run size data. 
Least squares regression line is shown. 


judged will be given momentarily. Further, let (z;,y;) for 4 = 1, 2, 3, ..., n denote 
the data pairs being examined, and let e; measure the deviation of the best fit line 
from the data; that is, for each i 


ei = Yi ~ Gi = yi — (a + bai). 
Define the total error, E, by 


poste yi — (a+ bax). 


Our objective will be to choose a and 6 so as to minimize &. This goal is known as 
the least-squares criterion and produces the least-squares regression line. 
To minimize E, we must have 


eB OE _, 
GaSe 
Since 
OE us 
ae =a ly: — (a + bz;)] 
and 


nm 


OE . 
ab. ode (yi — (a+ bx4)] 2, 
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the system of two equations 


n nr 
na + bs” C= an 
i=] i=1 


n n 
ada; +b) > 2? = So ai 
i=l i=1 


i=] 
is obtained. The solution of this system is 
bm aie MHYE = (Ly Bi) (HLH) 
2 
ey x? = oan xi) 


a=y— bz, 


where £ = (30, a:)/n and 9 = (32, ys) /n are the mean of the x and the y 
values, respectively. 
Returning to the data from the manufacturing firm problem, we find 


Th nr n 
n=20, Sa = 33,399, S~y; = 371,298, S > ziyi = 685,996, 183 
i=] i=1 i=1 
and 
nr 
Sa} = 63,345,673. 
i=] 


Substituting these values into the equations for a and b gives 


(20)(685, 996, 183) — (33, 399)(371, 293) _ 1,319, 108, 753 


b= “—~"(20)(63, 345, 673) ~ (33,300)? 151,420,959 © 88-74/unit 
371,293 33,399 1,319, 108,753 _ 
O99. sop. agian 


Therefore, the regression line for the total production cost as a function of run size 
is 7 = 4016.76 + 8.712. This line is graphed in Figure 5.19 for comparison. If the 
standard run size for this product line is 2200 units, we would then predict that the 
total production cost would be $4016.76 + (2200 units)($8.71/unit), or $23,178.76. 
As in most practical regression problems, the parameters a and b have significance 
beyond being the y-intercept and slope of the regression line, respectively. Here 
a represents the fixed costs, or overhead, associated with this particular product, 
whereas 6 represents the variable costs of production. 


Worked Examples 


EXAMPLE 5.22 A New Exit on the Jersey Turnpike 


The following table gives information about the tolls charged at the main toll booth 
at the southern end of the New Jersey Turnpike. The distance values are the number 
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of miles from the indicated exit to the main toll booth, and the “tol!” values are the 
toll charge assessed for a vehicle which enters the turnpike at the indicated exit. 


Exit Distance Toll Exit Distance Toll 
2 12 0.45 8 67 1.70 
3 25 0.70 8A 73 1.85 
4 33 0.95 9 82 2.15 
5 43 1.20 10 87 2.20 
6 50 1.85 1l 90 2.40 
ve 52 1.45 12 95 2.65 
TA 59 1.55 


After examining traffic patterns, the Turnpike Authority has decided to con- 
struct a new exit, Exit 5A, a distance of 48 miles from the main toll booth. Based 
on the information contained in the table, what toll should be charged for a vehicle 
that enters the turnpike at Exit 5A? 

A scatter plot of the toll charge as a function of distance would suggest a 
roughly linear relationship. We will therefore ft the data to a linear function. 
Letting x denote the distance and y denote the toll charge, we find 


nr mr nr nr 
n=13, So a= 768, Soy = 211, So ay= 14496 and ) > x? = 53628. 
i=l i=l i=l j=l 
Using these values, it follows that 


__ (18)(1449.6) — (768)(21.1) _ 
fe (13)(53628) -- (768)? $0.0246/mile 


21.1 768 
= — — — - 0.0246 = $0.17. 
B B 0246 = $0.17 
In other words, the toll structure for the listed exits consists of a fixed charge 
of roughly 17 cents for simply using the turnpike and a variable charge of roughly 
2.5 cents per mile. Based on this toll structure, the toll for the new exit 48 miles 
from the main toll booth should be 


9 = 0.17 + (0.0246)(48) = $1.35. 


EXAMPLE 5.23 Calibration of a Thermocouple 


A group of physics students has constructed a thermocouple for use in a laboratory 
experiment. To calibrate the device, they have collected the following data. 
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Temperature (°C) Reading (mV) 


0 0.01 
20 0.12 
40 0.24 
60 0.38 
80 0.51 
100 0.67 
120 0.84 
140 1.01 
160 1.15 
180 1,31 


The end result of the calibration process is supposed to be a formula which trans- 
lates a thermocouple reading, in millivolts, into a temperature reading, in °C. We 
will therefore let « denote the thermocouple reading, and y denote the correspond- 
ing temperature. This assignment leads to the values 


n nm nr nm 
n=10, Som =6.24, Sy; = 900, do 2iyi = 804.6 and S "2? = 5.6898, 
i=l i=l i=l i=} 
which yield the calibration parameters a * 5.574°C and 6 ® 135.298°C/mV. The 
temperature corresponding to a given thermocouple reading, z, is then determined 


by the equation 
T = 5.574 + 135.2982. 


Transformations to Linear 


In many circumstances, a linear model is inappropriate. For example, the error 
sequences generated by fixed point iteration schemes in Chapter 2 satisfied power 
laws of the form |en+i] = Cle,|*, where a represented the order of convergence. 
Growth and decay phenomena often obey exponential laws of the form y = ab*. 
Still other phenomena are modeled by logarithmic laws, y = a + blogz; reciprocal 
laws, y = 1/(a + bx); and higher-degree polynomial laws. Here, we focus on power 
laws and exponential laws, which can be transformed to linear problems, and log- 
arithmic laws, which are already linear in the model parameters. Other cases will 
be considered in the exercises. 

Let’s start with the power law, y = az’, Taking the logarithm of both sides 
of the power law (the base of the logarithm is irrelevant) yields 


logy = log(az*) 
= loga + blogz. 
Hence, log x and logy are related linearly when x and y are related by a power law. 


Therefore, to fit data to an equation of the form y = az°, first take the logarithm of 
the x-values and the y-values, then perform linear regression on the resulting data 
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set. The y-intercept of the regression line is the logarithm of the coefficient in the 
power law, and the slope of the regression line gives the exponent. 

Exponential laws, y = ab?, can be handled in a similar manner. Take the 
logarithm of both sides of the exponential law to yield 


log y = log(ab™) 
=loga+ zlogb. 


Therefore, to fit data to an exponential law, take the logarithm of the y values, then 
perform linear regression. The y-intercept of the regression line is the logarithm of 
the coefficient in the exponential law, while the slope gives the logarithm of the base 
of the exponential. In many instances, the exponential law is written as y = ae’. 


In these cases, the logarithm of the law becomes 
Iny = Ina + br, 


so that the slope of the regression line is the coefficient in the exponent. 


Worked Examples 


EXAMPLE 5.24 Order of Convergence 


In general, a numerical method for approximating the value of a definite integral 
is said to have order of convergence a if the absolute approximation error, FE, is 
related to the mesh spacing parameter, h, by the power law & = Ch® . Simpson’s 
rule is one general technique for approximating the value of a definite integral. The 
method involves dividing the integration interval into an even number of equal-sized 
subintervals and computing a weighted sum of the values of the integrand at the 
endpoints of the subintervals. The mesh spacing parameter for Simpson’s rule is 
the size of the subintervals into which the integration interval is partitioned. 

To determine experimentally the order of convergence of Simpson’s rule, we 
approximate the value of the definite integral 


1 
t ze” dz 
0 


for several different values of h and compute the absolute error in the approximation. 
The results of these experiments are 


h Absolute Error, & 
1/2 2.620728 x 10-8 
1/4 1.690471 x 1074 
1/8 1.065014 x 107° 
1/16 6.669677 x 107-7 
1/32 4.170636 x 1078 
1/64 2.606974 x 107° 

1/128 1.629410 x 1071° 
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Next, we fit the data to the power law & = Ch*. Since we are working with a power 
law, we will perform a linear fit of In # versus Inh. Note that we have chosen to 
use the natural logarithm, but any other logarithm would also work. The resulting 
regression line is 

In E = —3.159 + 3.992 nh. 


Exponentiating this equation gives E = 0.0425h3-9%?. Therefore, for this problem, 
it appears that Simpson’s rule is roughly of order of convergence 4. 


EXAMPLE 5.25 CD Sales versus LP Sales: An Exponential Model 


The table below summarizes the sales of compact discs (CDs) and long playing 
records (LPs) over an eleven year period. Sales are listed in millions of units. 


CDs 0 08 58 23 53 102 150 207 287 333 408 
LPs 244 210 205 167 125 107 72 35 12 48 2.3 


Based on a scatter plot of the data (shown in the upper left panel of Figure 5.20), an 
economist hypothesizes that the data follows either an exponential law or a power 
law. To decide between the two, the economist plots the logarithm of the LP sales 
versus the CD sales (upper right panel in Figure 5.20) and the logarithm of the LP 
sales versus the logarithm of the CD sales (lower left panel). Since the former plot 
is more roughly linear, the economist settles on the exponential} law. 

The resulting regression line is found to be 


log L = 2.401 — 0.00479¢, 


where L denotes the level of LP sales, C’ denotes the level of CD sales and com- 
mon (base 10) logarithms were used. Exponentiating both sides of the regression 
equation yields L = 251.768 - (0.989)°. The graph of this equation is shown super- 
imposed on the data in lower right panel of Figure 5.20 to demonstrate the accuracy 
of the fit. 


—  — — 


EXAMPLE 5.26 Break-Even Point for Vitamin A Dosage 


To estimate the amount of vitamin A required for maintaining weight, laboratory 
rats were fed a basic diet devoid of vitamin A, but were given controlled supple- 
mentary rations of vitamin A in the form of cod liver oil. The following table 
summarizes the supplementary dosage of vitamin A and the corresponding weight 
gain for the test subjects. 


Dosage (mg) 0.25 1.00 1.50 2.50 7.50 
Weight Gain (grams) -108 135 16.4 28.7 - 513 
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Figure 5.20 (Upper left} Scatter plot of data for CD sales versus LP 
gales. (Upper right) Logarithm of LP sales versus CD sales. (Lower 
left) Logarithm of LP sales versus logarithm of CD sales. (Lower 
right) Exponential Jaw fit to the data points. 


A graph of the data (Figure 5.21) suggests fitting the data to a logarithmic law of 
the form W =a+élog D. Using base 10 logarithms, we find 


W = 12,762 + 41.661 log D. 


Setting W = 0, it follows that D ~ 0.49 mg of vitamin A is required to maintain 
weight. 


EXERCISES 


1. One of the following data sets follows an exponential law and the other follows 
a power law. Which is which? 
x 2.0 2.5 3.0 3.5 4.0 4.6 3.0 
yy 14.79 27.75 447.09 74.07 109.99 156.10 213.69 


£ 2.0 2.5 3.0 3.5 4.0 4.5 5.0 
yz 12.13 19.58 31.59 50.97 82.21 132.59 213.82 
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Figure 5.21 Scatter plot of data for weight gain of laboratory rats 
versus supplementary dosage of vitamin A. Solid curve is the logarithmic 
law fit to the data points. 


2. One of the following data sets follows an exponential law and the other follows 
a power law. Which is which? 
x 2.0 2.0 3.0 3.5 4.0 45 5.0 
yi. -(1.216 «1.087 0.972 0.870 0.778 0.696 0.622 


£ 2.0 2.5 3.0 3.5 4.0 4.5 5.0 
y2 1.108 0.758 0.556 0.427 0.341 0.279 0.233 
3. One of the following data sets follows a logarithmic Jaw and the other follows a 


power law. Which is which? 
x 2.0 2.5 3.0 3.5 4.0 4.5 5.0 


yw 16.50 17.77 18.89 19.88 20.79 21.62 22.40 


z 2.0 2.5 3.0 3.5 4.0 4.5 5.0 
yo 11.73 14.54 16.84 18.78 20.46 21.95 23.27 
4, Experimental data relating the oxide thickness, measured in Angstroms, of a thin 
film to the baking time of the film, measured in minutes, is given in the table 


below, 
Baking time 20 30 40 60 70 90 100 120 150 180 
Oxide thickness 3.5 7.4 7.1 15.6 11.1 14.9 235 27.1 22.1 82.9 
(a) Construct a scatter plot of this data. What functional form is most appro- 
priate for fitting this data? 

(b) Fit the data to the function indicated in part (a). What physical significance 
do the model parameters have? : 

(c) Predict the oxide thickness for a film which is baked for 45 minutes. 
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. The total production cost as a function of the number of machine hours is pro- 
vided for a sample of nine production runs. Estimate the fixed costs and the 
variable costs associated with this process. 

Machine hours 22 23 19 #12 12 9 7 11 14 

Total cost (in 1000s) 23 25 20 20 20 15 14 14 16 


. The resistivity of platinum as a function of temperature is given below. Estimate 
the parameters in a linear fit to the data and predict the resistivity when the 
temperature is 365 K. 
Temperature (K) 106. 200 300 400 500 
Resistivity (Q-cm, x10°) 41 80 126 163 194 


. The table below shows the time (in seconds) required for water to drain through 
a hole in the bottom of a bottle as a function of the depth (in inches) to which 
the bottle has been filled. 

Depth 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 
Time 65.99 120.28 166.69 207.85 245.41 279.95 313.04 344.24 


(a) Construct a scatter plot of this data. What functional form is most appro- 
priate for fitting this data? 


(b) Fit the data to the function indicated in part (a). 


. The weight, W, of a metallic object decreases over time when exposed to a 

caustic environment according to the exponential law W = ae7'/ 7 where t is 

the exposure time and 7 is known as the decay rate constant. Data for a group 

of objects made from the same material is given in the following table. 
Exposure time (days) 5 10 15 20 2 30 35 40 
Weight (grams) 92.7 58.3 59.5 41.7 45.6 31.8 383 19.9 

Estimate the decay rate constant, 7, for this material. 
. Barometric pressure, P, as a function of elevation above sea level, h, is modeled 
by the relation P = ae®", Use the data in the table below to estimate the 


model parameters and to predict the barometric pressure at an elevation of 1200 
feet. 


Barometric pressure (mm Hg) 29.9 29.4 29.0 284 27.7 
Elevation above sea level (feet) 0 500 1000 1500 2000 


. When an ideal gas undergoes an isentropic process, the pressure and volume 
are related by P = cV~7, where ¥ is the ratio of the specific heats of the gas. 
Estimate the value of -y based on the values in the following table: 

Pressure (psi) 16.8 39.7 78.6 115.5 195.0 546.1 

Volume (in?) 50 30 20 15 10 5 


The results of a tensile strength test for a circular cold-rolled steel specimen are 
provided in the table below. The specimen had an original diameter of 0.507” 
and an original length of 2 inches. The normal stress, a, and the normal strain, €, 
are given by the equations 


Cae and €= 


L ? 
where P denotes the load, A the elongation, a the original cross-sectional area, 
and L the original length of the specimen. From the test data, we want to 
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12. 


13. 
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estimate the modulus of elasticity, B, which is defined as the ratio o/e in the 
linear portion of the stress-strain curve. 


Load Elongation Load Elongation 
(10° Ib) (1074 in) (10% Ib) (1074 in) 

0 0 4,85 16 
1.25 4 5.45 18 
1.85 6 6.05 20 
2.4 8 6.7 22 
3.05 10 7.25 24 
3.64 12 6.9 40 


4.25 14 6.95 80 


Note: For this problem, you will first need to decide which of the data points 
correspond to the linear portion of the stress-strain curve. 
The following table gives the ion concentration, n, as a function of time, t, after 
an ionization agent has been turned off. 

Time (sec) 0 1 2 3 4 5 6 7 8 9 610 

n(x1074) 5.03 4.71 4.40 3.97 3.88 3.62 3.30 3.15 3.08 2.92 2.70 


Theory indicates that ion concentration and time satisfy the reciprocal rela- 


tionship ' 
0 


oan ge nogatt’ 
where ng is the initial concentration of ions and a is the coefficient of recombi- 
nation. 
(a) Take the reciprocal of the above equation relating ion concentration and 
time, and show that n~! and ¢ are related in a linear fashion. 
(b) Perform linear regression on n~} versus t to estimate the initial concentra- 
tion of ions and the coefficient of recombination. 
Consider the following data relating the amount of varnish additive and the 
resulting varnish drying time. 
Additive (grams) 00 10 20 30 40 50 60 7.0 8.0 
Drying time (hours) 12.0 10.5 100 80 7.0 80 7.5 85 9.0 


(a) Produce a scatter plot of the data and show that the data roughly follows 
the pattern of a quadratic function. 

(b) Apply the least squares criterion to the regression equation y = a+ ba + ca? 
to determine formulas for a, b, and c. 


(c) Use the results of part (b) to determine the regression parabola for the data 
given above. What amount of varnish additive will produce the minimum 
drying time? 


CHAPTER 6 


Differentiation and Integration 


AN OVERVIEW 
Fundamental Mathematical Problems 


In this chapter we will discuss the concepts and techniques associated with numeri- 
cal differentiation and integration. In particular, we will address the following three 
problems. 


Problem 1 
Approximate the value of a derivative of a function defined by discrete 
data. 


Problem 2 
Derive a formula that approximates the derivative of a function in terms 
of a linear combination of function values. 


Problem 3 
Approximate the value of the definite integral of a continuous function, 
defined by a formula or by discrete data, over a specified interval. 


The “Estimating a Coefficient of Friction” problem capsule from the Chapter 1 
Overview (see page 5) is one example of an application that requires numerical 
differentiation. Here are two applications that require numerical integration. 


Bags of Pine Bark Mulch 


Kindly Doe B has an irregularly shaped region in his back yard (see Figure 6.1). 
He has tried to grow every conceivable plant in that spot, but nothing but weeds 
ever seem to grow; he has therefore decided to cover the entire plot with pine bark 
mulch. If Doc B would like to lay down a uniform 38-inch covering of bark mulch 
over the entire plot and the home improvement store sells bags containing 3 cubic 
feet of bark mulch, how many bags does Doc B need to buy? 

Clearly, the number of bags needed can be obtained by dividing the capacity 
of each bag (3 cubic feet of mulch) into the total volume of mulch required to cover 
the plot. In turn, because the plot is to be covered with a layer of mulch 3” thick, 
the volume of mulch needed for the job is just one-quarter the area of the plot. 
Consequently, 

number of 1 


Gage needed = 45 x area of the plot. 
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Width measurements recorded 
every foot along length of plot 


8.0 feet 


26 feet 


9.0 feet 


6.0 feet 


10.0 feet 


11.0 feet 
15.0 feet 
16.0 feet 
16.0 feet 
16.0 feet 
16.0 feet 
15.5 feet 
15.5 feet 
15.5 feet 
15.0 feet 
15.0 feet 
15.0 feet 
14.5 feet 
14.5 feet 
14.0 feet 
14.0 feet 
14.0 feet 
13.5 feet 
13.0 feet 
13.0 feet 
13.0 feet 
12.0 feet 


11.0 feet 


Figure 6.1 Plot of irregularly shaped region for the “Bags of Pine 


Bark Mulch” problem. 
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Now, let x denote distance measured along the length of the plot and let f(z) 
denote the width of the plot at location x, with all distances measured in feet. The 
area of the plot is then given by 


26 
f(x) da. 
0 
To determine the number of bags of pine bark mulch Doc B must purchase, we will 
therefore have to approximate the value of this definite integral using the discrete 
data from Figure 6.1. 


Tabulating the Error Function 


There are many special mathematical functions that are defined in terms of definite 
integrals. One of these is the so-called error function, which is given by 


9 cc 
erf(z) = al ef dt. 


This function has applications in probability, statistics, heat conduction, boundary 
layer theory, groundwater flow, and so on. 
Suppose we want to tabulate values of erf(z) and its derivative, 


d 2 
—erf(z a? 


dz ra ae ’ 


at. equally spaced points from x = 0 to x = 2. All values are to be accurate to five 
decimal places, and the increment between z-values, Az, must be chosen so that 
Hermite cubic interpolation from the table produces an error less than 5 x 107. 
Now, the error introduced by Hermite cubic interpolation is bounded above by 


4 


d 
age Ht (e) 


1 


4 
384 (of) ogee? 


Since é 
a ss (x) 


ast <5, 


max 
O<2<2 


the requirement that the interpolation error be less than 5 x 107° translates into 
needing Az < 0.14. Let’s choose Az = 0.1 for convenience. 

Evaluating the derivative of the error function at the required z-values is 
straightforward. To complete the table, however, we must evaluate 


2 7 2 
re Vr / et dt 
19) 


for « ranging from 0 to 2 in increments of 0.1, guaranteeing that each value is 
accurate to five decimal places. 
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The Remainder of the Chapter 


The first three sections focus on the techniques for numerical differentiation. In Sec- 
tion 3 the important concept of extrapolation—combining different approximations 
obtained from the same low-order formula to obtain a higher-order approximation— 
is introduced. Our study of numerical integration begins in Section 4, where 
all of the important definitions are provided. This section also includes a dis- 
cussion of Newton-Cotes quadrature. The following sections consider composite 
Newton-Cotes quadrature, Gaussian quadrature, Romberg integration, and adap- 
tive quadrature. The chapter concludes with a section dealing with improper inte- 
grals and the proper handling of singularities. 


6.1 NUMERICAL DIFFERENTIATION, PART I 


Numerical differentiation normally arises in one of two contexts. In the first, the 
objective is to approximate the value of a derivative of a function defined by dis- 
crete data. For example, we may want to approximate the endpoint derivative 
values needed to construct a clamped cubic spline. In the second context, the 
objective is to derive formulas which approximate the derivatives of a function in 
terms of a linear combination of function values. These formulas form the basis for 
finite difference techniques for the solution of boundary value problems and partial 
differential equations which will be considered in Chapters 8 through 11. 


In this section the problem of approximating the value of a derivative from dis- 
crete data will be considered. As you read through this section, note that the main 
concern is assessing the reasonableness of the computed approximation. Deriving 
formulas for derivatives will be treated in the next section. 


Estimating a Coefficient of Friction 


Let’s begin with the problem of estimating the coefficient of friction between a 
flexible rope and the post around which it is wrapped from the experimental data 
given below, which measures the force required to overcome a 5-lb restraining force 
as a function of the angle through which the rope is wrapped around the post. 
Recall from the Chapter 1 Overview {see page 5) that the coefficient of friction, p, 
is the proportionality constant between the rate of change of the force and the 
magnitude of the force; that is, 


_ aF/dé 
Oe 


Given the data in Table 6.1, how might we approximate the value of dF'/dé? Con- 
sidering the material we just covered in Chapter 5, the most appropriate choice 
might be to pass an interpolating polynomial through the data and then differen- 
tiate the polynomial. 
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Figure 6.2 Polynomial fit to force data shown in Table 6.1. Experi- 
mental data are represented by circles. 


8 OQ n/2 nm 38nf/2 2 5r/2 3H Tr/2 4n Qnf2 5a 
F(@) 5.00 7.83 12.27 19.22 30.10 47.15 73.86 115.70 181.24 283.90 444.71 


TABLE 6.1: Force Data 


Using all eleven data points, we obtain the interpolating polynomial 


2 3 
Pyo(@) = 54+ 4.6343491912 + 1.180101675 (=) + 2.401095921 i) 
‘ia 


a @\> pa 
— 1.867465364 (5) + 1,331981301 (=) — 0.5264888061 @ 


Q\" ay? 
1-0.1357248487 ( } — 0.02107936082 (2) 


Tv 


9 10 
+ 0.001848324094 (=) — 0.00006772484988 (2) ‘ 
t ; 


Figure 6.2 shows that this polynomial provides a reasonable fit to the data over the 
entire range of @ values. Evaluating Pi9(0) and its derivative at 0 = 57/2 provides 
the estimate p © 0.29. This result has been rounded to two decimal places, the 
same precision as the experimental data. Figure 6.3 shows that, with the exception 


of a small region near 7/4, the same two decimal place estimate would have been 
obtained throughout the domain. 
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Figure 6.3 Approximate coefficient of friction obtained from 
Pio(®)/Pro(). 


N 0.0521 0.1028 0.2036 0.4946 0.9863 2.443 5.06 
DD 1.65 2.10 2.27 2.76 3.12 2.92 2.07 


TABLE 6.2: Diffusivity of Copper Compounds 


Diffusivity of Copper Compounds 


Table 6.2 contains values of the diffusivity, D, of copper compounds from ion- 
exchange resins for various values of normality, N. A scientist needs to use this 
data to determine the normality that gives rise to the maximum diffusivity and the 
corresponding maximum diffusivity value. 

Unlike our previous example, using all of the data points to determine an 
interpolating polynomial in this example provides rather miserable results (see Fig- 
ure 6.4), An examination of the data suggests that the diffusivity will achieve its 
maximum somewhere between N = 0.4946, N = 0.9863, and N = 2.443. Interpo- 
lating these three values produces the polynomial 


P(N) = —0.4462381347N? + 1.392987806N + 2.180191091. 


This downward opening parabola achieves its maximum value of 3.26 at N = 1.5608. 
Although this polynomial provides plausible results, note P(0.0521) = 2.2516 and 
P(5.06) = —2.1966, which implies that P(N’) does not accurately reflect the cur- 
vature of the data. 

To obtain a more accurate estimate of the maximum diffusivity, we turn to 
spline interpolation. Figure 6.5 displays the not-a-knot cubic spline interpolant for 
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Figure 6.4 Interpolating polynomial for diffusivity data given in Ta- 
ble 6.2. Experimental values are denoted by circles. 
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Figure 6.5 Not-a-knot cubic spline interpolant for diffusivity data 
given in Table 6.2, Experimental data are denoted by circles. 
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Figure 6.6 Not-a-knot cubic spline interpolant for partial pressure of 
water data. Experimental data are denoted by circles. 


the diffusivity data, There are two observations that we can make from this figure. 
First, the behavior of the cubic spline is consistent with the experimental data over 
the entire domain. Second, the maximum diffusivity appears to occur just to the 
right of NV = 0.9863. The cubic polynomial associated with that portion of the 
spline is 


s(N) = 2.780284551 + 0.6109023877N — 0.2996338780N* + 0,02987363472N'", 


whose maximum is 3.134 at N = 1.2549. 


Convection Mass Transfer Coefficient 


Experimentally determined values for the partial pressure of water vapor, pa, as a 
function of distance, y, from the surface of a pan of water are given below. 


y (mm) 0 1 2 3 4 5 
pa fatm) 0.100 0.065 0.042 0.029 0.022 0.020 


Approximating the water vapor as an ideal gas, the convection mass transfer coeffi- 
cient is related to the derivative of the partial presstire at the surface of the water. 
Estimate the value of this derivative from the experimental data. 

Figure 6.6 displays the not-a-knot cubic spline interpolant for the partial 
pressure data. The portion of the spline that applies when y = 0 is 


0.1 ~ 0.04142222222y + 0,00663333333y? — 0.00021111111y%, 
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from which it follows that 


EXERCISES 


1. 


= —0,04142222222 atm/mm. 


Rework the coefficient of friction problem from the data in Table 6.1 using a 
not-a-knot cubic spline interpolant rather than a 10th-degree interpolating poly- 
nomial. 


. Rework the convection mass transfer coefficient problem using a single interpo- 


lating polynomial of degree at most 5. 


. Estimate the temperature, T, at which the sound speed, a, of water is a maxi- 


mum. What is the corresponding maximum speed of sound in water? 


TC) 0 10 20 30 40 50 60 70 80 90 100 
a (m/s) 1402 1447 1482 1509 1529 1542 1551 1553 1554 1550 1543 


- The following table provides the height of water in a container as a function of 


time during an experiment dealing with Toricelli’s law. Estimate the rate at 
which the height of water is changing at ¢ = 90 seconds. 
time (sec) 0.0 13.2 29.4 446 61.8 80.1 99.8 121.5 148.3 174.9 
height (inches) 55 5.0 45 40 35 30 25 20 15 1.0 


. The thermal resistance, 2, as a function of insulation thickness for a thin-walled 


copper tube is provided in the table below. Estimate the insulation thickness 
that corresponds to minimum thermal resistance. 


thickness (mm) 0 2 5 10 20 40 
thermal resistance R (m-K/W) 6.37 5.52 5.18 5.30 5.93 7.06 


. The specific heat at constant pressure, cp, is given by 


_ f{ Oh 
Cp = OT > 
Pp 
where hk denotes enthalpy and T denotes temperature. The parentheses around 
the partial derivative are used to indicate that the pressure, p, is to be held 
constant during this calculation. 

The enthalpy of superheated nitrogen as a function of temperature is given in 
the table below. Use these data to estimate the specific heat at constant pressure 
of superheated nitrogen at a temperature of 200 K. Is the specific heat at constant 
pressure of superheated nitrogen constant over the range of temperatures 150 K 
through 250 K? If not, by how much does it vary? 


T (K) 100 125 150 175 200 
h (kJ/kg) 101.965 128.505 154.779 180.935 207.029 


T (K) 225 250 275 300 
A (kJ/kg) 233.085 259.122 285.144 311.158 
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7. In optical microlithography one of the most important performance metrics is 
the sidewall angle of the photoresist film at the completion of the development 
phase. Sidewall angle is a function of many different input parameters, including 
exposure energy, development time, thickness of contrast enhancing film, and 
numerical aperture. Sensitivity of sidewall to any one of these input parameters 
is measured by what is known as process latitude. Let @ denote the sidewall angle 
and u denote one of the input parameters. The process latitude with respect to uw 
is given by 


where ug is known as the nominal value of the input parameter, and al] other 

input parameters are assumed held fixed in the computation of the derivative. 

(a) Sidewall angle as a function of contrast enhancing layer thickness is given in 
the table below. Estimate the process latitude with respect to film thickness 
at a nominal value of 0.20 um. 


film thickness (um) 0.00 0.10 0.20 0.30 0.40 
9 (degrees) 80.7 83.8 85.7 86.2 86.3 


(b) Sidewall angle as a function of numerical aperture is given in the table 
below. Estimate the process latitude with respect to numerical aperture at 
a nominal value of 0.24. 
numerical aperture 0.16 0.20 024 0.28 0.32 
6 (degrees) 76.0 81.1 83.5 85.0 85.7 


8. Sidewall angle as a function of resist bleaching rate constant is given below. 
Estimate the resist bleaching rate constant which gives rise to the maximum 
sidewall angle. 


bleaching rate constant (cm?/mJ) 0.01 0.02 0.03 0.04 0.06 0.08 
@ (degrees) 744 76.3 775 77.5 76.8 76.1 


6.2) NUMERICAL DIFFERENTIATION, PART Il 


The approach that we used in Section 6.1 to obtain an estimate for the value of 
the derivative—pass a polynomial through the data and differentiate the resulting 
interpolating function—can also be applied to obtain generic formulas for approx- 
imating the derivatives of a function in terms of a linear combination of function 
values. Although the formulas developed in this section can be used to solve prob- 
lems like those posed in Section 6.1, they are used primarily in the solution of 
differential equations, both ordinary and partial, via the finite difference method. 
This method will be discussed in Chapters 8 through 11. 


Two Formulas for Approximating the First Derivative 


Let’s begin by developing a formula for approximating the first derivative of an 
arbitrary function, f, at x = 29. The simplest meaningful approximation that can 
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be obtained for f’(xo) requires using data from one other point, say 2 = 2). If f is 
interpolated through 2 = 2 and x = 2%), interpolation theory guarantees that 


4 eae Pal Z— Xo 


f(z) = f (#0) + 


Zo - 2, maple) tte ial — a0) 2): 


The Lagrange form of the interpolating polynomial was selected because it clearly 
isolates the dependence of the polynomial! on the function values being interpolated. 
The divided difference form for the error term has been selected because it is easier 
to manipulate than the derivative form. 

If we now differentiate f with respect to x, we obtain 


f'(@) = —— fl) + 


9 — 21 Z1 — 
+ flzo, a1, 2](2x —Z%o—- 21). 


flea) + (w — te) (0 = #1) flo, 21, 


Evaluating this expression at + = Zo then yields, after some simplification, 


f'(ao) = Leda fo) 5, 


x Zo}(Lo — Ly). 
— Or 21; ol( 0 1) 


Recall that there is a connection between divided differences and derivatives; in 
particular, if f has two continuous derivatives, then there exists a € between xo 
and 2, such that 

£8) 


flxo,21, Zo] = ar aa 


Therefore, 

f(t1) — fo) , Zo- 21 

4 fo) = —— oO * peliasdaeaa 17 . 1 

ia. Ss =a (1) 
It is typical to use uniformly, or equally, spaced points when developing nu- 

merical differentiation formulas and to denote the spacing between points by the 

parameter h. Substituting x, = x9 + h into equation (1) produces 


fi(aq) = Leet A= fo) _ A 1g), (2) 


where 2 < € <2%9 +h, while substituting 2; = zo — h into (1) produces 


F(€o) = flto — h) 
h 


f(a0) = 3H"), (3) 
where %y —h < € < xg. The first term on the right-hand side of equation (2) and 
of equation (3) represents our formula for approximating f’(z9); the second term 
on the right-hand side of each equation is the error term, which shall be discussed 
in more detail below. Since the approximation in (2) uses data to the right of ao, it 
is referred to as a FORWARD DIFFERENCE APPROXIMATION. For similar reasons, 
the formula in (3) is referred to as a BACKWARD DIFFERENCE APPROXIMATION. 
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Alternative Derivation 


Difference approximations can also be obtained through the use of Taylor’s theorem. 
For example, suppose that f has two continuous derivatives. Then, by Taylor’s 
theorem, we may expand f(a + h) as 


2 
f(to +h) = fo) + hf" (20) + SF") 


where 2p < € < o+h. Solving this equation for f’(x9) yields the forward difference 
approximation in (2), Had we started with an expansion for f(z — h), we would 
have reproduced the backward difference approximation in equation (3). 


The Error Term 


The error terms in equations (2) and (3) actually provide us with two pieces of infor- 
mation. First, if we assume that f is continuous with two continuous derivatives on 
a, b], then the Extreme Value Theorem guarantees that f” is bounded on either of 
the closed intervals [29,9 +/] or [zo —h, zo]. The approximation formulas from (2) 
and (3) therefore introduce an error which is proportional to the first power of the 
spacing parameter, or step size, h. Since the error depends on the first power of h, 
the formulas in (2) and (3) are said to provide first-order approximations to the 
first derivative. From a practical standpoint, when a first-order formula is used and 
the step size is cut by a factor of 2, say, we expect the error to drop by that same 
factor, The following example illustrates this point. 


EXAMPLE 6.1 Approximating the Derivative of the Natural Logarithm 


Consider the function f(2} = Inz. The table below displays the results of approx- 
imating the value of the derivative of the natural logarithm at a» = 2 using the 
first-order formulas given by (2) and (3). 


h Flzoth)~ Jo) Arrot 

f'(2) = BGO-WE9) — 9.405465 0.094535 
0.1 fi(2) x MEUM? 0) — 0.487902 0.012098 
0.01 f(a} in2.01}-In(2.0) = 0.498754 1.2458 x10-$ 
0.001 f’(2) ay m2ooL=In(2.9) — 9.499875 1.2496 x 10-4 


A Seo =f(eenh)} error 

fi(2) = BEM) — 9.693147 0.193147 
O1 — f?(2) mw 20) m0.9) — 9.512933 0.012933 
1n(2.0)™ 10.98) 9 501954 1.2542 x10-2 


es O01 
0.001 f/(2) a BO nO 999) 9.500125 1.2505 x10~4 
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Note how each time the step size, A, is cut by a factor of 10, the corresponding 
error also drops by a factor of 10. This is characteristic of first-order approximation 
forroulas. 


The second piece of information that can be obtained from the error term 
is related to the order of derivative that appears. In (2) and (3), each error term 
contains f”. Since every constant and every linear function has a second derivative 
that is identically zero, we are guaranteed that the formulas in (2) and (3) will 
provide the exact value of the first derivative of every constant and every linear 
function, regardless of the step size used. In general, if the error term involves 
f@, then the corresponding approximation formula will provide exact results for 
all polynomial functions of degree up to and including n — 1. 


EXAMPLE 6.2 Formulas in (2) and (3) Are Exact for Constant and Linear 
Functions 


Due to the linearity of both the derivative and our difference approximation for- 
mulas, we need only consider the functions f(z) = 1 and f(#) =z. 


~—f{eo E@o)—flao—h) 


fi(zo) “eot4 


fai 0 i 0 
fiej=a 1 1 1 
f(x) =2? = Io Qag th 229 — A 


A Higher-Order Approximation for the First Derivative 


Using the function values f(zj) and f(z1), the best we can do is generate a first- 
order approximation to f‘(g). To obtain a second-order approximation to the 
first derivative at 2 = 29—that is, one for which the error term involves h?—the 
function value at another point, say zo, must be included. There are two choices 
for the placement of ze relative to zo and 2): place zo on the same side of xp as 
2 or place zy on the opposite side. 

First, suppose we have been using Zp and 2, = tp +/ and place ze at 294 2h. 
From interpolation theory, we know 


fa) = FPN 2a) 495) — PAO 2 pay) 


2h? he 
~*~ 2) F(¢,) + f[20,21, 02, 2|(2 — xo) (2 — ay} (a — £2). 


Differentiating this expression with respect to z and evaluating at z = zo produces 


- +4ff hj- bh h? 
F'(x0} os 3f (x9) feos ) flzo +2 ) + af"@) (4) 
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for some € between zp and zp + 2h. The details of this derivation are left as an 
exercise. Since values of f to the right of zg only have been used and an error term 
that depends on the second power of h has been obtained, this is a SECOND-ORDER 
FORWARD DIFFERENCE APPROXIMATION to the first derivative. If h is replaced 
by —h in (4), the SECOND-ORDER BACKWARD DIFFERENCE APPROXIMATION 


f' (20) = 3f (xo) im Af (xo H + f(Zo aE 2h) + 7 fe) (5) 


is obtained. 

Now suppose that x9 is placed on the opposite side of xg, and hence data 
from 2p — h, Zo and £9 + A are used to develop an approximation. If the standard 
procedure is followed (interpolate, differentiate with respect to z, and then evaluate 
at x = 2), we obtain 


F(zo +h) — f{xo — A) _ nme), (6) 


P'(@0) = Oh 6 


Since this formula uses data from either side of xo, it is referred to as a central 
difference approximation. In particular, it is the SECOND-ORDER CENTRAL DIF- 
FERENCE APPROXIMATION to the first derivative. 


EXAMPLE 6.3 Verifying Second-Order Approximation for Formulas (4), 
(5), and (6) 


Consider the function f(z) = lnz. The table below displays the results of approx- 
imating the value of the derivative of the natural logarithm at zg = 2 using the 
second-order formulas given by (4), (5), and (6). 


h Equation (4) Error Equation (5) Error 

0.1 (2) 0.499252 7.4762 x10~4 —f’(2) 0.499063 9.3669 «1074 
0.01 f/(2)~ 0.499992 8.2405 x10-6 —f’(2) © 0.499992 8.4280 x10-® 
0.001 f'(2) 0.500000 8.3245 «10-8 = f’(2) = 0.500000 8.3430 x1078 


h Equation (6) Error 

0.1 f'(2) 0.500417 4.1729 x10~4 
0.01 f’(2) 0.500004 4.1667 x10~® 
0.001 f’(2) # 0.500000 4.1666 x1078 


Note how each time the step size, h, is cut by a factor of 10, the corresponding 
error drops by a factor of 100. This is characteristic of second order approximation 


formulas. 
Bie ek ye 


Although the formulas in equations (4), (5), and (6) are all second order accu- 
rate, note that for each fixed value of h, the central difference formula produces an 
error which is roughly half that of the other formulas. This should not be surprising 
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upon closer examination of the error terms: the coefficient in (6) is half the coeffi- 
cient in the other equations. Furthermore, the central difference formula required 
only two function evaluations, as opposed to three function evaluations for the for- 
ward and backward difference formulas. As seen in equations (2) and (3), using two 
function evaluations with a forward or backward difference formula achieves only 
first order accuracy. In general, central difference formulas require fewer function 
evaluations to achieve a given order of accuracy. For this reason, central difference 
formulas are used most often in practice. 


A Formula for the Second Derivative 


The simplest meaningful approximation to the second derivative requires the use 
of three points. The central difference formula will be developed here. Derivations 
of the forward and backward difference formulas will be left for exercises. If f is 
interpolated at zo —h, @p and xp +A and the resulting polynomial is differentiated 
twice with respect to 2 and then evaluated at x = 2, we obtain 


F (vo +h) — 2F(xo) + f(xo — h) 


£" (x0) = 12 


2h? © feo = h, 20,20 + h, z| 


Next, using the definition of the derivative and the definition of divided differences, 
we find that 


d 
Gai [0 —h, 2,20 + h, z] oe 
: flzo — 4, 20,20 +h, 20 + Al — flxo —h, 29, 20 + A, Xo] 
= ig. A oH sf ———.aoM 
A=0 A 
= fim flo ~ hy 20,20 + h, xo + A, 20] 


= f [zo — h, 29,20 +h, £9, £9]. 


Finally, substituting this result into (7) and replacing the divided difference by 
A f(6), where zp —h <& < ao +h, yields the SECOND-ORDER CENTRAL DIF- 
FERENCE APPROXIMATION for the second derivative: 


= xt 2 
FY (ay) = Loot WP Go) + Ht 8) _ Bang), (8) 


The Effect of Roundoff Errors 


Considering our discussion of the propagation of roundoff error in Chapter 1, it 
should not be surprising that numerical differentiation is unstable with respect 
to roundoff errors. After all, the formulas we just finished deriving involve the 
subtraction of nearly equal numbers followed by a division by a small number. 
To demonstrate this idea, consider the O(h?) central difference approximation 
for f’: 
f(zo+h)-f(to-h) BP 


fi (ao) = AFR To 56). 
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Figure 6.7 Error bound for second-order central difference approxi- 
mation to the first derivative as a function of step size, when roundoff 
error is taken into account. 


When the values of f are entered into the computer, a roundoff error will be intro- 
duced, so that our calculations will actually be made with f(zo +h) and f(zo—h), 
where . 

f(to +h) = f(zo +h) + exo +h), 

F(@o — h) = f(to ~ h) + e(zo — h) 
and e(zp + h) and e(%o — h) are the respective roundoff errors. Substituting these 
values into the central difference formula leads to 


_ f(to +h) — flao—h) 4 (zo +h) = eto h) _ h? 


f a 
f'(xo) Oh Oh gt (8): 
If we assume that |e(zp + h)| < €, then 
F(aot+h)— flto—h)|  « , PM 
if 

NS AN ah Sp she, 
Go) oh a ar 

where M = maxaer<p [f(z = || flog. Note that as h — 0, the truncation error 


term h?.\4/6 — 0, but the roundoff error term €/h — 00 (see Figure 6.7). 


EXAMPLE 6.4 Instability of Numerical Differentiation 
Consider approximating f’(0) for f(z) = e* using the second-order central differ- 


ence formula 
f(to +h) — flto—h) 


f(t) = Oh 
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Working in double precision, we find that for step sizes from 107! down to 107, 
the second-order character of the approximation forrnula is evident. With each re- 
duction by a factor of 10 in the step size, the total error drops by a factor of 100. As 
the step size is reduced further, the total error transitions from the truncation error 
dominated domain into the roundoff error dominated domain, and the accuracy of 
the approximation deteriorates rapidly. 


EXERCISES 


1. 


Step Size Error Step Size Error 
1071 1.67 «107-8 10-9 2.72 x1078 
1072 1.67 «1078 10-79 8.27 x10-8 
10-3 1.67 «1077 1o-!1 8.27 x1078 
1074 1.67 x107° 10-12 3.34 x1075 
1075 1.21 x1071! 10738 2.44 x1074 
1076 2.68 x107}! 10-4 ~—- 8,00 x10~4 
1077 5.26 x10-10 10-5 5.47 x 107? 
1078 6.08 x 1079 10716 0.445 


Derive the second-order central difference approximation for the first derivative, 
including error term: 


_ flo +h) ~ feo ~ A) 


i nr? a 
feo) " - FI". 


. Derive equation (4). 
. Derive equation (7). 


. (a) Derive the following difference approximation for the first derivative: 


f(zo + 2h) — f(xo - hy 


f'(e0) * 3h 


(b) What is the error term associated with this formula? 
(c) Numerically verify the order of approximation using f(z) = Inz and zo = 2. 


. (a) Derive the following forward difference approximation for the second deri- 


vative: 


fio) © f (x0) — 2f (xo mM + f(zo + 2h) 


(b) What is the error term associated with this formula? 
(c) Numerically verify the order of approximation using f(z) = e” and zg = 0. 


. (a) Derive the following backward difference approximation for the second de- 


rivative: 


F'(a0) . f(t — 2h) - 21 to ~h)+ CON 


(b) What is the error term associated with this formula? 
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10. 


Ii. 


12. 


138. 


id, 
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(c) Numerically verify the order of approximation using f(x) = Inz and zp = 2. 


. (a) Derive a formula for approximating the first derivative of an arbitrary func- 


tion at x = xo using four equally spaced points, with two (2) of those points 
to the left and one (1) to the nght of z = a9. 

(hb) What is the order of approximation for the formula obtained in part (a)? 
Completely justify your response. 


. (a) Derive a formula for approximating the first derivative of an arbitrary func 


tion at « = 29 by interpolating at «= 290 +h and z= x9 - ah for a> 0. 
(b) Show, analytically, that the formula from part fa) is second order when 
a = 1, but only first order for a # 1. 


. (a) Derive a formula for approximating the second derivative of an arbitrary 


function at = zo by interpolating at c= a9 +h, 2 = 249 anda =az9-—ah 
for a > 0. 

(b) Show, analytically, that the formula from part (a) is second order when 
& = 1, but only first order for a #1. 


{a) Using f(x) = Inz and zo = 2, demonstrate numerically that the central 
difference approximation for the second derivative given by 


flag — h) — 2f{ao) + feo +h) 


fl (Zo) row he 3 


is second order accurate. 
{b) Repeat part (a) using f(x) = e* and xq = 0. 
Verify that each of the following difference approximations for the first derivative 
provides the exact value of the derivative, regardless of A, for the functions 
f(x) =1, f(z} =a and f(x) = 2”, but not for the function f(z) = ee. 
(a) (20) ~ =Si(eo} tattzoth)= Nap ta) 


(b) f' (x0) ~~ BSlz)—4fGo— hy (eo 2h) 

(c)  f!(xo) iheat i) teonh) 

Verify that the second-order central difference approximation for the second 
derivative provides the exact value of the second derivative, regardless of the 
value of h, for the functions f(#} =1, f(t) =a, f(z) = g?, and f(x) = 2°, but 
not for the function f(z) =<". 


(a) Use the formula 


f' (xo) ee f(xo + " Ee Flze) 


to approximate the derivative of f(z) = 1+a4 x? at eo = 1, taking 
hk = 1,0.1,0.01, and 0.001. What is the order of approximation? 

(b) Repeat part (a) for to = 0. 

(c) Explain any difference between the results from part (a) and those from 
part (b). 

(a) Use the formula 
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to approximate the derivative of f(z) = sing at wo = 7, taking h = 
1,0.1,0.01, and 0.001. What is the order of approximation? 
(b) Repeat part (a) for zo = «/2. 
(c} Explain any difference between the results from part (a) and those from 
part (b). 
15. Consider the following formula for approximating the first derivative of an arbi- 
trary function: 


PG) 2 — 20 — 3h) + OF (0 ~ Bh) ~ TF (G0 — 8) + NF G0) @ Gl 
where to — 8h < € < ap. 

(a) Suppose that the function values used in the above formula contain round- 
oft/data errors that are bounded in absolute value by ¢ and that the absolute 
value of the fourth derivative is bounded by M. Derive a bound for the ap- 
proximation error associated with the above formula as a function of «, M, 
and h, 

(b) Suppose € = 5.96 x 107® (machine precision in IEEE standard single pre- 
cision). Determine the value for the step size h that minimizes the bound 
on the error when approximating the value of the derivative of f(z) = e* at 
fo =1, 

16. Consider the second-order forward difference formula for approximating the first 
derivative of an arbitrary function: 


= ~3F (#0) +4f (#0 +h) = flwo +2h) | 1 


2g 
sr"), 


f'(xo) 
where zo < € < ap +2h. 

(a) Suppose that the function values used in the above formula contain round- 
off/data errors that are bounded in absolute value by ¢ and that the absolute 
value of the third derivative is bounded by M. Derive a bound for the ap- 
proximation error associated with the above formula as a function of «, M, 
and h. 

(b) Suppose « = 1.11 x 10°1© (machine precision in IEEE standard double’ 
precision). Determine the value for the step size h that minimizes the bound 
on the error when approximating the value of the derivative of f(z) = Ina 
at zo = 2. 


6.3. RICHARDSON EXTRAPOLATION 


In the previous section several first- and second-order finite difference approxima- 
tion formulas for first and second derivatives were obtained. Higher-order formulas 
can of course be derived by interpolating more data points, but an alternative for 
obtaining higher-order approximations is to use a procedure known as extrapola- 
tion. The basic idea behind extrapolation is that whenever the leading term in 
the error for an approximation formula is known, we can combine two approxi- 
mations obtained from that formula using different values of the parameter h to 
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obtain a higher-order approximation. This process will be illustrated in this sec- 
tion for finite difference approximations to derivatives—the technique is known as 
Richardson extrapolation. In a later section, we will apply extrapolation to numer- 
ical integration forroulas, and in the next chapter, we will apply extrapolation to 
numerical solutions of initial value problems. 


Small-O Notation 


Recall that a function g(h) is said to be big-O of h* as h — 0, written g(h) = O(A*), 
provided there exists a positive constant DL such that 


g(h) 


“ye | Se 


for all sufficiently small h. In other words, g(h) 0 as h — 0 at least as fast as h*. 
If it happens that 


so that g(h) > 0 as h — 0 faster than h*, then g(h) is said to be small-O of h¥, 
written g(h) = o(h*). 


Richardson Extrapolation 


Let’s consider the second-order central difference formula for f”: 


Hy f(to +h) — f(%o — b) nh? uw 
C0) gee re! (é). 
For notational convenience, let D denote the true value of the derivative, and let 
D,, denote the approximation obtained using a step size of h. From our derivation 
in the previous section, we know that 2 —h < € < to +h; hence, the squeeze 
theorem guarantees 
gE — Zo as h- 0. 


Therefore, as h — 0 


D-Dp 
he 


sf" (to) 


= une) = 


=> D = Dp + O(n). 


Let’s now look at the error term more precisely. Since 


=|j 


ha Ai — nw Af h? di = ota 
ad (€) = ra (#0) + 6 (6) — F(z), 
we see that 
_ _ 1? ee : Ww aw 
sim [P= Pa POE OD in |= 196) - FC] 
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Therefore, D = Dp, + Kyh? + o(h*), where Ky = ad’ (29). 

As stated earlier, the process of extrapolation uses two approximations com- 
puted from the same forrnula, but with different values of h, to obtain a higher- 
order approximation. For the second-order central difference approximation to f’, 
we have just established 


D=D,4 Kh’ + ofA); 
hence, 
h 2 
D = Daya + Ky (5) + of h*), 
Upon subtracting these two expressions, we obtain 
3 
0= Dy - Dayo t re + o(h?), 


which can easily be solved for Kyh?—the leading term in the error of our original 
approximation: 


4 
Kh? = 3 Paya _ Dy) + o(h?), 


We now substitute for K,A? in the original approximation to obtain 


4 
D = Dp + glPas2 _ Dp] + a(h*) 


4Drjg -— D 
= 1Pna— Pr a4), 
3 
The formula AP nye Pn is a higher-order approximation for the first derivative in 


the sense that it converges faster than h?, as opposed to at the same rate as Rh, 


EXAMPLE 6.5 Extrapolating the Derivative of the Natural Logarithm 


Let’s reconsider the approximation of the derivative of the function f(z) = Inz 
using the second-order central difference approximation formula. With xo = 2, we 
find 


h Dy = Leo+h) {to h) Error 


0.1 0,500417292 Ai? x1074 
0.05 0.500104205 1.04 x 1074 
0.025 0.500026043 2.60 x 1075 


Applying the extrapolation formula to these results produces 


1D, 
h AOR Pe rroy 


3 
0.1 0.499999842 1.58 x1077 
0.05 0.499999989 1.10 x1078 


Not only has the approximation error been significantly reduced as a result of 
extrapolation, the cut in error due to the cut in the step size is also larger. In par- 
ticular, it appears as if the extrapolated values are fourth-order approximations— 
having cut A by a factor of 2, the error has dropped by a factor of 16! 
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Seeing the improvement generated in the previous example, the natural ques- 
tion to ask is, Can we extrapolate again? If we know the power of h in the new 
leading term in the error, then the answer is yes. 


Repeated Extrapolation 


In order to keep track of the various approximations that will be generated, let’s 
modify our notation slightly. Let pw denote the original approximation, pe 
denote the first extrapolation, De denote the next extrapolation, and so on. Fur- 
thermore, we will adopt the convention that the step size associated with an extrap- 
olated value is the larger of the two step sizes used to calculate the extrapolated 
value. For example, for the second-order central difference approximation to f’, we 
would write 


(1) (1) 
4D yi, — Dy, 
ey sae 
Provided that f has five continuous derivatives, it can be shown that 


pe = 


D = D™ + Kyh? + Kah! + o(h4), 


where pw stil! refers to the second-order central difference approximation for f’ 
and K, and K2 are constants independent of h. To establish this formula, first 
expand f(xo +h) and f(zo — h) in Taylor series about the point x = xo: 


2 3 4 hd 
Flag +h) = flav) + AY" (20) +2 pao) +f" e0) +H (He) + 9G FOES) 


2 3 4 hs 
f(ao — h) = f(xo) — hf’ (eo) + sito) - ito) + 5 (00) = amie ee 


where xo < &4 < to th and a —h < & < xo. Subtracting the bottom expansion 
from the top and solving for f’(zo) gives 


h? na 
(20) = DD + SF" 0) + 3 [FE + FOE) 
= pW h? ww \+ P* soV¢g) 
= Dat Gt Me) ¥ G9 
ag, os hey as) ©) 
= DO +E (a9) + ICO) + s5q [FO — FC20)] 


where 29 —h < & <a +h and the Intermediate Value Theorem has been applied 
to the term involving the fifth derivative in going from the first line to the second. 
Since € — ao as h -+ 0, the term in square brackets in the final line is o(h4) . 
Setting Ky = f’"(zo)/6 and Kz = f©)(zp)/120, we have the desired formula: 


D = DY) + Kyh? + Koh! + o(h'). 
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After one extrapolation, this becomes 
— pn 1p4 4 
D =D) + Kin + o(h4), 
which verifies our observation at the end of the last example—the extrapolated 


values are fourth-order approximations. The next extrapolation proceeds in exactly 
the same manner as the first. We will subtract 


h 4 
D= DO + Ky (3) + o(h4) 
from 
D =D + Kih4 + o(h4), 
leaving 


15 
0 = Di? — Dy + Fp kaht + off). 


This last expression gives 


16 
Kypt = = [DY ~ DP] + ofh'), 
so that 
16 

D=De) 4 = PP, - D?| + o(h’), 

or 
16D}, - DP 
ee) se hf{2 A 
ps 


15 


EXAMPLE 6.6 A Second Extrapolation 


In the last example, we found 
ho DG) = SPaw Ps Error 
0.1 0.499999842 1.58 x1077 
0.05 0.499999989 1.10 x1078 
Extrapolating again, we find 


(2) (2) 
pos 16P 905 — out = 0.499999998. 


The error in this final approximation is 1.20 x1079. 
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Note that the formula for pe) has a structure similar to the formula. for De 
In particular, the lower order approximation with the larger step size is subtracted 
from a roultiple of the lower order approximation with the smaller step size. This 
result is then divided by the sum of the coefficients in the numerator. The coefficient 
of 16 arises from the fact that the step size was cut by a factor of 2 and the lower 
order approximations were fourth-order (ie., 16 = 2). In general, if a pth-order 
formula is extrapolated by cutting the step size by a factor of 6, the extrapolation 
formula will take the form 

BP Day — Dn 
be-1} 

When performing repeated extrapolations, it is convenient to organize the calcu- 
lations into an extrapolation table like the following one. Listing the order of ap- 
proximation associated with each column helps to keep track of the weights needed 
to compute each successive column. 


Step Size O(h?+) O(hP?) O(hPs) 


h pe) 

be SD. De 
net, De 
nya th, DD, 


EXAMPLE 6.7 Using a Different Low-Order Approximation Formula 


The extrapolation table below displays approximations to the derivative of f(x) = 
tan7! a at xo = 2 starting from the first-order forward difference approximation 


f (ao +h) = Flt) 


1 
Dw = 


It can be shown (see Exercise 9) that 
D= Di) + kyh + koh? + kh? + o(h?), 


where k1, ko, and kg are constants independent of h. Since each row of the extrap- 
olation table uses a step size one-half that from the previous row, the weights used 
to produce the second, third, and fourth columns of the table are 2 and —1, 4/3, 
and —1/3 and 8/7, and —1/7, respectively. 


h O(h) O(h?) O(h?) 

1 0.141897054 

0.5  0.166282463 0.190667872 . 

0.25  0,181693117 0.197103771 0.199249070 

0.125 0.190440208 0.199187299 0.199881808 0.199972199 


The error in the final approximation is 2.78 x10-5. How small would the step size 
have to be to obtain the same accuracy using the original approximation formula? 


EXERCISES 


a 
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In the last example, extrapolation was used to obtain an approximation to the 
first derivative of f(z) = tan7! 2 at 2g = 2 with an error of 2.78 x107°. The 
smallest step size used in the construction of the extrapolation table was h = 
0.125. Starting approximations for the extrapolation table were obtained from 
the first-order forward difference formula 


pM) = f(to+ ne f(zo) 


What step size would be needed in the first-order forward difference formula to 
obtain the same accuracy, 2.78 x107°, as the final extrapolated value? 


. In the first example, extrapolation was used to obtain an approximation to the 


first derivative of f(z) = Inz at zo = 2 with an error of 1.20 x107°. The 
smallest step size used in the construction of the extrapolation table was h = 
0.025. Starting approximations for the extrapolation table were obtained from 
the second-order central difference formula 


pe = f(zo+ n)— Sle —h) 


What step size would be needed in the second-order central difference formula 
to obtain the same accuracy, 1.20 x107°, as the final extrapolated value? 


In Exercises 3-7, fill in the missing values from the given extrapolation table. The order 
of approximation associated with each column is indicated above the column, and with 
each new row, h is reduced by a factor of two. 


3. 


O(h?) O(n) 
0.7398169125 
0.7187845413 ? 
0.7104251526 2 3 
O(h) O(h?) 
—0.9397248595 
—0.8555953748 ? 
—0.7887202658 ? ig 
O(n?) O(h*) O(n’) 
0.7500000000 
0.7083333333 ? 
0.6970238095 ? 0.6931746034 
? 0.6931545307 14 0.6931474775 
O(h?) O(n?) O(n*) 


1.0471975512 

? 1.1444682995 
2 ? 1,1523449594 
1.1514137785 = 1.1540323927 


1.1544141092 ? 


454 


7. 


8. 


10. 


11, 
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O(h) O(h?) O(n’) 
? 
0.6065306597  0.8451818783 
0.7788007831  0.9510709063 °? 
? ? 0.9979003940  0.9995479864 


Let D denote the true derivative of a function, and let D, denote the first-order 
backward difference approximation to the derivative; that is, 


D, = 1{to) = F(zo =k) 
h 
Tt can be shown that 


D = Dp + kih + koh? + kgh® + o(h5), 


where ky, ka, and kg are constants independent of h. Let f(z} = In(x? +1) and 

zo = 1. 

(a) Starting from h = 1, approximate the value of the first derivative of f at xo 
by applying extrapolation to D;. Use four rows in your extrapolation table. 

(b) What is the error in the final approximation? 

(c) What step size would be needed in the first-order backward difference for- 
mula to obtain the same accuracy as the final extrapolated value? 


. Assuming that f has four continuous derivatives, show that 


D =D) + kth + koh + kgh? + o(h5), 


where D denotes the true derivative of a function, pe) denotes the first-order 
forward difference approximation to the derivative and k1, kz and k3 are con- 
stants independent of h. [Hint: Use Taylor’s theorem to expand f{a#o +h) about 
the point x = 29.] 
(a) Show that 

D=Dythyh? + koh? + kgh4 + o(h'), 


where D denotes the true derivative of a function, Dp, denotes the second- 
order forward difference approximation to the derivative 


2h 


and ki, kz, and k3 are constants independent of h. [Hint: Use Taylor’s 
theorem to expand f(g +h) and f(o + 2h) about the point 2 = z9_] 

(b) Let f(z) = 2/2? +4 and zo = —1. Starting from ) = 1, approximate the 
value of the derivative of f at xo by applying extrapolation to Dp. Use four 
rows in your extrapolation table. What is the error in the final extrapolated 
value? 


(a) Show that 


Dy = 


D = Dp + kih? + koh! + kgh® + o(h°), 


where D denotes the true second derivative of a function, D;, denotes the 
second-order central difference approximation to the second derivative and 
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ky, ke, and kg are constants independent of h. [Hint: Use Taylor’s theorem 
to expand f{rg +h) and f(z — h) about the point z = x9] 

(b) Let f(z) = x%e™ and xp = 0. Starting from A = 0.5, approximate the 
value of the second derivative of f at x9 by applying extrapolation to Dp. 
Use three rows in your extrapolation table. What is the error in the final 
extrapolated value? 

12. (a) Approximate the derivative of f(z) =1+2+2° at xo = 0 using the first- 
order forward difference formula. Take h = 1/4 and h = 1/8, and then 
extrapolate from these two values. 

(b) What is the error associated with each of the approximations computed in 
part (a)? Explain any unusual behavior in the errors. 

13. (a) Approximate the derivative of f(z) = sing at x9 = mW using the first-order 
forward difference formula. Take h = 1/4 and h = 1/8, and then extrapolate 
from these two values. 

(b) What is the error associated with each of the approximations computed in 
part (a)? Explain any unusual behavior in the errors. 


NUMERICAL INTEGRATION—THE BASICS AND 
NEWTON-COTES QUADRATURE 


The fundamental problem of numerical integration (which is also called numerical 
quadrature) can be stated as follows: 


Given the function, f, continuous on [a, 5], approximate I(f) = it F(z) dz. 


If an antiderivative of f—that is, a function F(z) such that F’(z) = f(x}— 
can be found, then the Fundamental Theorem of Calculus guarantees that I(f) = 
F(t) — F(a). There are problems with putting this rule into practice, however. 
First, there are functions f for which the antiderivative cannot be expressed in 
terms of standard functions. A simple example is f(z) = e~* . Second, for other 
functions, the antiderivative may be so complicated that it would be easier to work 
with the integrand directly. Third, functions encountered in practice are often 
defined in terms of discrete data, not a formula. These are the primary motivations 
for studying the techniques of numerical integration/quadrature. 

Most quadrature formulas take the form 


I(f) = In(f) = > wif (as), 
i=0 


where the x; are known as the quadrature points, or abscissas, and the w, are called 
the quadrature weights. Note that £,(f) is a linear operator, just like I(f): that 
is, 


Ii(f +9) = Fal f) + alg) 
Tn(cf} = cln(f). 
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Generally, there are two approaches which can be taken to develop numerical 
quadrature formulas. In the first approach, the z; are fixed, typically as equally 
spaced points from within the interval (a, b|, and the w; are then computed by fitting 
a function to the f(2;) data and integrating the resulting function exactly. When 
the chosen interpolating function is a polynomial, a NEWTON-COTES QUADRATURE 
rule is obtained. In the second approach, given the number of data points n, 
the weights and abscissas of the quadrature rule are selected to achieve maximum 
possible accuracy. Formulas developed in this manner are known as GAUSSIAN 
QUADRATURE rules. The basic theory of Newton-Cotes quadrature is developed in 
this section and the next; Gaussian quadrature is considered in Section 6.6. 


General Newton-Cotes Formulas 


The basic procedure for developing Newton-Cotes quadrature rules is to first fix 
the abscissas 2g, 21, Z2, ..-, Ln € [a,b]. Next, interpolate the integrand, f, at the 
abscissas by the polynomial P,(z). Finally, integrate the interpolating polynomial 
and set. 


I(f) x In(f) = I(P,). 
“~~ =“ —_—— 
exact value Newton-Cotes exact value 
of integral of formula of integral of 
original integrand interpolating 
polynomial 


Because we want the final quadrature rule to exhibit a clear dependence on the 
data values, f(x;), the Lagrange form of the interpolating polynomial will be used: 


nm 
P,(z) = oF Ly (0) f(a). 
i=0 
Hence, Newton-Cotes quadrature rules will take the form 


b Te 
If) = fo tnsla Ses) a 


4=0 


n b 
= y (/ inet) f(z) 


7=0 


3 


= wif (xi), 
4=0 


where 


b 
w= | Lm if) dz. 


Within this framework, there are two distinct varieties of Newton-Cotes for- 
mulas, which differ in their choice of the abscissas within the interval [a, 6]. The 
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so-called closed Newton-Cotes formulas include the endpoints of the integration in- 
terval, « = aand x = b, among the abscissas. For a given n, we take Ax = (b—a)/n, 
and the abscissas are 2; = a+ iAz for each i= 0, 1, 2,.-., 2. On the other hand, 
the open Newton-Cotes formulas do not include the endpoints of the integration 
interval among the abscissas. For these formulas, At = (b— a)/(n + 2), and the 
abscissas are 2; = a+ (¢+1)Az for each i = 0, 1, 2,...,n. 


Some Closed Newton-Cotes Formulas 


Let’s start with the simplest closed Newton-Cotes formula, which corresponds to 
n= 1. The spacing between abscissas is Ar = 6 ~ a, and the two abscissas are 
Xo =a and zr; = b. Since the Lagrange polynomials associated with these points 
are 

x-a 

pa: 


b-— 
Ly o(x) = — and Ey (z) = 


it follows that the quadrature weights are given by 


b b 
b-: - 
up = f T dex and w= / 2 * ae. 
q 0~@ a 0-a 


The substitution z = a + tAz in the integrals defining wo and wy, simplifies the 
final calculations: 


1 
A 
wy = Ax | (1-t)dt= = 
0 


1 
y= Ae f oe 
2 

0 


Therefore, the closed Newton-Cotes quadrature formula corresponding to n = 1 is 


I(f) © Lr ctosea(f) = =-Li(@) + FO) = —F-[/(@) + FO). 


Geometrically, this quadrature rule approximates the value of the definite integral as 
the area of a trapezoid (see Figure 6.8); hence, this rule is known as the trapezoidal 
rule. 

The case n = 2 also produces a well-known formula. Here, Ax = (6 — a)/2, 
and the three abscissas are zp = a, ©} = a+Ag = (at+b)/2 and zz =a+2Ar =b. 
The quadrature weights are found to be 


» Ag [? A 
w= f Lg o(a) dx = al (t 1)(é 2) dt = = 


b 2 AN 
w= | Laa(e)de = As [ e-2)a¢ = 
a 0 


b 2 A 
WwW. ay L22(z) dz = =f t(e = 1) de = =. 
a 0 
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Figure 6.8 The trapezoidal rule. 
The substitution 2 = a + tAx was once again used in the integrals defining wo, w; 
and wz to simplify the final calculation. Therefore, 
A +b 
KD) © Trea) = F |s10) +44 (*F") + 700) 
b-a a+b 
= 5" [pay + ar (4) + 20]. 


This formula may be recognized from calculus as Simpson’s rule. 
The closed Newton-Cotes formulas with n = 3 and n = 4 are 


I(f) * Toes f) = 5" [f(a) + 3f(a+ Bz) + 3f(a+ 202) + FO), 
where Az = (b—a)/3, and 


I(f) % I4,ctosea(f) = 
+ (7#(a) + 32f(a + Az) +12f(a + 2Az) + 32f(a+3Ac) + 7F(O)1, 
where Az = (6—a)/4. The n = 3 formula is also known as the three-eighths rule, 


while the n = 4 formula is known as Boole’s rule. The derivation of these formulas 
is left as an exercise. 
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Some Open Newton-Cotes Formulas 


The simplest open Newton-Cotes formula corresponds to n = 0. With n = 0, 
Ag = (b— a)/2, and the only abscissa is #9 = (a + 5)/2. The quadrature weight is 


b b 
w= [ Lao(e) de = f 1-dxr=b-a. 
a 


a 


Therefore, the n = 0 open Newton-Cotes quadrature formula is 


at+b 
UF) % Joonen(f) = (6~ ays (2). 
This formula is known as the midpoint rule. 
For n = 1, Ax = (b—a)/3, and the abscissas are vy = at+Az and 2) = a+2Az. 
The Lagrange polynomials associated with these points are 


at 2Azx — 
Ly,0(2) = ——— and Lia(z) = 


2 -{a+Az) 
Az 


With these functions, the corresponding quadrature weights are 


b 3 3 
a+2Ar-—2x 1 3Axz 
=| ———awe= =2)d) =A oe ee 
wo / = dx az f (2 —t) dt z (21 st ) ; 5 
and 
b 3 3 
Pe GPA) 7 lis _ 3Ax 
w= f sae ae = Ac | (t-1)dt= Az (5 t a 2”? 


where the change of variable z = a+ tAz has been made in each integra]. Putting 
everything together, the open Newton-Cotes formula with n = 1 is 


I(f) © Ti open(f) = o—*Ifla+ Az) + f(a+ 2Az)). 
The open Newton-Cotes formulas with n = 2 and n = 3 are 
I(f) © Tnopen(f) = “4 [2f(a+ Az) — f(a+2Az) + 2f(at 3Az)], 
where Ar = (b — a)/4, and 
I(f) © Isopen(f) = 
°—* [li f(a + Az) + f(a+2Az) + f(a+ 3Az) + 11f(a+4Az)), 


where Az = (b—a)/5. The derivation of these formulas is left as an exercise. 
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Error Analysis 


Like the error term associated with a difference approximation, the error term 
associated with a quadrature rule provides two pieces of information. First, the 
error term indicates precisely how the error depends on the length of the integration 
interval. This information will prove extremely useful in later sections. Second, the 
error term allows us to determine the degree of precision, which characterizes the 
class of polynomials for which the quadrature formula produces exact results. 


Definition. The DEGREE OF PRECISION (or AGCURACY) of a quadrature 
rule I,(f) is the positive integer m such that 

I(p) = I,(p) for every polynomial p of degree < m 

I(p) # In(p) for some polynomial p of degree m + 1. 


Based on this definition, a quadrature rule that integrates every constant 
polynomial, every linear polynomial, and every quadratic polynomial exactly but 
fails to integrate at least one cubic polynomial exactly would be said to have degree 
of precision equal to 2. Though this may seem to be a cumbersome quantity to 
determine, since polynomials are just linear combinations of powers of x and both I 
and J, are linear operators, we only need to consider whether the rule integrates 
the powers of x exactly. Therefore, if a rule integrates 1, 2, and x? exactly but fails 
to integrate z° exactly, the degree of precision is 2. 

A powerful tool for deriving error terms associated with quadrature formulas 
is the following theorem, known as the Weighted Mean-Value Theorem for Integrals. 


Theorem. If f is continuous on [a, d], g is integrable on [a,b] and g(x) does 
not change sign on [a, 0], then there exists a number € € [a,b] such that 


/ "pedolayda = 40) | ol0) ts 


Proof. Suppose that g(z} > 0 on [a,b]. The proof for g(x) < 0 is similar, 
and the details are left as an exercise. Let m and M denote the minimum 
and maximum value, respectively, achieved by f on [a,b]. Since g(x) > 0, it 
follows that 

mg(x} < f(a)g(x) < Mg(z) 


for all z € [a,b]. Consequently, 
b b b 
mf g(x) dx < | F(zjg{z) da < u | g(z) dz. 
a a Qa 


If f. g(z) dz = 0, then ii f(x)g(z) de must also equal 0, so any g € 
[a, 6] can be chosen to satisfy the requirements of the theorem. Otherwise, 


ia g(x) dx > 0. Therefore, 


Jo f(@)9(a) de 


mx< 5 <M. 
fo g(x) dx: 
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Applying the Intermediate Value Theorem, there exists a € € [a,b] such that 


b 
z)g(x) dx 
{Gee 
f, g(x) dx 
from which the conclusion of the theorem follows. go 


As a first example to demonstrate the process for determining the error asso- 
ciated with a Newton-Cotes formula, let’s consider the trapezoidal rule. Starting 
from interpolation theory, we know that 


f(x) = Pi(z) + fla,b, 2](z — a)(a — 8), 


where P; (2) is the unique linear polynomial that interpolates the integrand at 2 = a 
and « = b. Upon integrating both sides of this last expression and applying the 
definition of Iq closea(f), we find 


b 
Utes i finde Salar ide 


Observe that the function (x — a)(x — b) < 0 for all x € [a, b]. Thus, we can apply 
the Weighted Mean-Value Theorem for Integrals to simplify the error term. The 
result is 


b 
Hf) ~ hraesea(f) = fla, b,¢) [ (2 —a)(x — b) da 


6 
(b- a) Wee 
Os 


where a <& <b. 

Since the second derivative of every constant and every linear polynomial is 
identically zero, we see that the trapezoidal rule wili integrate every constant and 
every linear polynomial exactly; hence, the trapezoidal rule has degree of preci- 
sion 1. In general, when the error term for a quadrature rule involves the n-th 
derivative of the integrand, the rule has degree of precision n — 1. 


EXAMPLE 6.8 Verification of Trapezoidal Rule Degree of Precision 


The following table demonstrates explicitly that the trapezoidal rule integrates 1 
and x exactly, but fails to integrate x? exactly; hence, the degree of precision is 1. 


fle) {f@)de — Alf(a) + f(a +h)]/2 
1 b-a b-a 

z (Bb? — a?)/2 (b? - a?)/2 

zg (b3 — a*)/3 (08 — a° + ba* — ab?) /2 
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fe) fp fle) de h[f(a) + 4f(at h) + f(a + 2h)|/3 

1 b-a b-a 

x (b? ~ a*)/2 (6? — a?) /2 

zg? (BB —a3)/3 (b8 — a%)/3 

a (b4 — a4)/4 (b4 — a4) /4 

a* (bP —a°)/5 (5b° — ba + 2670? — 2078 + bat — 5a°)/24 


TABLE 6.3: Degree of Precision of Simpson’s Rule 


Because the error term associated with a quadratic interpolating polynomial 
involves a third derivative, one might expect that Simpson’s rule would have degree 
of precision equal to 2. The data contained in Table 6.3, however, suggests that 
Simpson’s rule has degree of precision equal to 3. 

The error term for Simpson’s rule therefore requires a closer look. From 
interpolation theory and the derivation of Simpson’s rule, we have 


b 
lene i flo, 21, b, 2] — a)(a — 2)(e —d) de, 


where x, = (a+ 6)/2. Note that the function (x ~— a)(% — x1)(z — b) changes 
sign on [a,b], so we cannot apply the weighted mean-value theorem for integrals. 
Suppose, instead, we integrate the error term by parts, taking u = fla, 21,b,a] and 
du = (x — a)(x — 2, )(x — b) dz. Remember that with integration by parts we may 
choose any antiderivative of dv. Here, we choose the specific antiderivative 


v= [e-ot-ayt-aa= (0 0)" — By. 


Then 
b 


If) = Iainwal f) = G(e- a)*(a — 0)? Flay, b 2] 


a 


j [ ; (< fla, x1, , i) (x — a)*(a — b)? de 


= -; J Fovartavaite a a)*(x — 6)? dx. 


Since (x — a)?(z — 6)? > 0 for all x & [a,b], the weighted mean-value theorem for 
integrals can now be applied. The end result for the Simpson’s rule error term is 


L ‘ 2 2 
I(f) ~ Tactoseal S) = ~Ffla,ar,6¢) [ (x — a) (x — b) da 


2 te a)> 
ae Tae 120 eo 
= Baa)" p08, 


2880 
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where a < é <b. From here, we see that Simpson’s rule does indeed have degree 
of precision equal to 3. 

The error term associated with any Newton-Cotes formula with an even n, be 
it an open formula or a closed formula, can be obtained in exactly the same manner 
as the Simpson’s rule error term was obtained. In particular, for the midpoint rule, 
it can be shown that 


[ “fle)de = (6 a)f (3°) + Coe ro, 


where a < € < b. The midpoint rule therefore has degree of precision equal to 1. 
The details of this derivation are left as an exercise. 

With the exception of the trapezoidal rule, the derivation of the error term for 
a Newton-Cotes formula with n odd is a bit more involved. As an example, consider 
the open Newton-Cotes formula with n = 1. Let Az = (b— a)/3, to =a + Az and 
2, =a+2Axe =b- Ag. The error in Sh open(f) is then given by 


5 
i=igthe / Cer Ceres 


As a first step in manipulating the error term, split the integration interval 
at x = b— Az; that is, write the error term as 


b 


b-Azr 
J Sleossalla-aay(o-a)de+ [fleets alte —ze)(e— a) de. (2) 


b-Az 


In the second integral, (2 — r9)(x — x1) > 0 for all z € [b-— Az, 0]. Applying the 
Weighted Mean-Value Theorem for Integrals leads to 


b 
[feo aus a(e = 20) ean) de = sa(b— a) FE), 


where a <& <b. 
Next, in the first integral in (1), replace the product f[x%o,2,,2](2 — v1) by 


flzo,2] — f[%o,2:], which follows from the definition of divided differences. A 
straightforward calculation gives 


b-Az 
/ (ime eee 


For the remaining integral, an integration by parts and application of the Weighted 
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Mean-Value Theorem for Integrals yields 


b-Ag 


b—Az 
/ flo, 2|(@ — x) de = ~~ 88) pg 


b-As 
= Feo, 2,2) <= 2) ag 
2 A (- a)® flo, €,&5| 
= ale- a)? f" (és). 


Bringing all of these pieces together, we find 


I(f) ~ Tiopea(f) = (- 0)° |r") + "A 


Assuming that f” is continuous, it can be shown (see Exercise 17) that € and & 
can be replaced by a common value &. Hence, 


—-a@a 3 A, 
I(f) a Ty open(f) = u 36 f'(€). 


Note that the open Newton-Cotes formula with n = 1 has degree of precision equal 
to 1. This is the same degree of precision as the midpoint rule, but the open 
Newton-Cotes formula with n = 1 requires more work than the midpoint rule (two 
function evaluations versus one). 

Using arguments similar to those given above, we can establish the following 
general theorem for the error associated with Newton-Cotes formulas. 


Theorem. Let J,(f) denote the Newton-Cotes quadrature rule (open or 
closed) with n + 1 abscissas. 

(a) If m is even and f has n+ 2 continuous derivatives, then there exists a 
constant c and a € € [a, 6] such that 


I(f) = In(f) — e(b — a) 3p **)(€), 


‘The degree of precision of J,(f) is n+ 1. 
(b) If x is odd and f has n+ 1 continuous derivatives, then there exists a 
constant c’ and a €’ € [a, }] such that 


If) = Inf) — f(b — a fN(e), 
The degree of precision of I,(f) is n. 


Remarks. (1) Note that the formulas with n = 4 and n = 5 both have degree 
of precision equal to 5. Therefore, formulas with an even n are generally 
better—they provide similar accuracy with fewer function evaluations. 


EXERCISES 


1. 


om & b 
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(2) The constants c and c’ depend on both 7 and the type of formula 
(open vs. closed). For given n, the constant for the closed formula is typi- 
cally smaller than the constant for the open formula, so closed formulas are 
used most often in practice. Because open formulas do not evaluate the in- 
tegrand at the endpoints of the integration interval, open formulas can be 
useful for certain problems with endpoint singularities. Open formulas also 
find use in the construction of numerical methods for the solution of initial 
value problems. 


Approximate the value of each of the following integrals using the trapezoidal 
rule. Verify that the theoretical error bound holds in each case. 


(a) f? lax 
(b) foe *de 
(c) fo reer de 


(d) i tan ta dz. 


. Repeat Exercise 1 using Simpson’s rule rather than the trapezoidal rule. 

. Repeat Exercise 1 using the midpoint rule rather than the trapezoidal rule. 

. Verify directly that the midpoint rule has degree of precision equal to 1. 

. Verify directly that the open Newton-Cotes formula with mn = 1 has degree of 


precision equal to 1. 


. (a) Determine values for the coefficients Ap, A1, and Ag so that the quadrature 


formula 


If) = [. fede Saeed (-3) + Ai flO) + Arf G) 


has degree of precision at least 2. 
(b) Once the values of Ao, Ai, and Ao have been computed, determine tne 
overall degree of precision for the quadrature rule. 


(a) Determine values for the coefficients Ap, Ai, and Ag so that the quadrature 
formula 


If)= iE flajdx = Aof (-3) + Aif (5) + A2f(1) 


has degree of precision at least 2. 
(b) Once the values of Ag, Ai, and Ag have been computed, determine the 
overall degree of precision for the quadrature rule. 


. (a) Determine values for the coefficients Ag, A1, and z) so that the quadrature 


formula, 
J 
if) = i f(«)de = Ap f(-1) + Arf(a1) 
= 


has degree of precision at least 2. 
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12, 


13. 


14, 


15. 
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(b) Once the values of Ap, 41, and z; have been computed, determine the 
overall degree of precision for the quadrature rule. 


. Consider the quadrature rule 


[,roens(3)+s(3), 


Determine the degree of precision of this formula. 
Consider the quadrature rule 


1 
[ ferarm Bs (-\2) TOES] (V3) 3 


Determine the degree of precision of this formula. 
Derive the error term for the midpoint rule: 
(6-a) 


2 dt 
Cal #8), 


wherea<€ <0. 


(a) Derive the closed Newton-Cotes formula with n = 3: 


P= f(a) +3f(a + Az) + 9f(a+ 202) + F(b)) 


If) I3 closed (f) _ 


(b) Verify that this formula has degree of precision equal to 3. 
(c) Derive the error term associated with this quadrature rule. 
(a) Derive the closed Newton-Cotes formula with n = 4: 


b 


I(f) © Injctosea{ f) = [7 f(a) + 32f(a+ Az) + 12f(a+ 2Az) 


+ 32f(a + 3Ac) +7f(d)]. 


-A 
90 
(b) Verify that this formula has degree of precision equal to 5. 


(c) Derive the error term associated with this quadrature rule. 
(a) Derive the open Newton-Cotes formula with n = 2: 


THR isl P= “8 Fla+ An) — fla+2Ac) + 2f(a +342). 


(b) Verify that this formula has degree of precision equal to 3. 
(c) Derive the error term associated with this quadrature rule. 
(a) Derive the open Newton-Cotes formula with n = 3: 


x [11 f(a + Ax) + f(a + 2Az) 


+ fla+3Az) + 11f(a+4Acz)]. 


If) * Iaopen(f) = 2 


(b) Verify that this formula has degree of precision equal to 3. 
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(c) Derive the error term associated with this quadrature rule. 
16. Prove the weighted mean-value theorem for integrals when g(x) < 0 for all 
z € [a, bj. 
17. (a) Let g be a continuous function on fa, | and let a), a2, a3, ..., an be any 
set of nonnegative numbers such that 


a,= A. 


i 


n 
= 


Show that for any set of points 21, 22, 43, ..., @a € [a,b], there exists a 
€ € [a,b] such that 


Yo aag(ns) = Ag(é)- 
i=l 


(b) Use the result of part (a) to show that, provided f” is continuous, there 
exists a € € [a, 6] such that 


5 a ] Ae 1 a 
347 (1) + ait (2) = 36 (€). 


6.5 COMPOSITE NEWTON-COTES QUADRATURE 


Let I;(f) denote the (open or closed) Newton-Cotes quadrature rule with 7 + 1 
abscissas. Suppose one computes 


L(f), fo(f), (Ff), ies An(f), 


adding more and more points to obtain greater accuracy and, hopefully, convergence 
toward the true value of the integral. 

Based on our studies of interpolation, we know that there can be some difficul- 
ties with this approach. First, since we are dealing with polynomial interpolation at 
equally spaced points, the sequence of approximations may not converge. Second, 
for large values of n, some of the quadrature weights may become negative and 
introduce sensitivity to roundoff errors. 

An alternative approach for improving the accuracy of an approximation is 
to subdivide the integration interval [a,b] into pieces and then apply a low-order 
Newton-Cotes formula on each subinterval. Numerical integration performed in 
this manner is referred to as Composite Newton-Cotes quadrature. 


Composite Trapezoidal Rule 
Recall that 


I(f)= Th ctosea (f) + error 


b~a (6 ca a)* u 
=“ [F(a) + F() -— =" p10), 
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If the integration interval [a, 5] is split into n subintervals by defining h = (b—~a)/n 
and x; =a+jh,0<j <n, and then the trapezoidal rule formula is applied on 
each subinterval [z;-1,2;], we obtain 


m= S0 f° Fe) ae 
ae 
= OE pease) + fla] - BY preg) 
j=l j=l 
h n—-l 73 n 
5 F(a0) +297 fas) + fen) Tod a) 
j=l j=l 


composite trapezoidal rule error 


where, for each 7, 23-1 < €; < 2. 

The error term needs to be examined more closely. Suppose f has two con- 
tinuous derivatives. Then the Extreme Value Theorem guarantees that there exist 
two constants ¢;, C2 € [a,b] such that 


fax) = max, f(a) 
#"(ca) = sin, f"(2) 
It then follows that for each 7 
Lf" (ea) < F'(Es) S Fer). 
Summing over each subinterval [z;-1,2,], we find that 
nf(en) $Y F"G) <ns"ter) 
j=l 


or 


f'(e) < sw) \< 4% ¢1). 


We can now conclude, by the aes Value Theorem, that there exists € € 
[a, b] such that f’(€) = 4 05, f’(€;). This implies that the error for the composite 
trapezoidal rule can be written as 


Hence 


; es b—a)h? 
[se te= 5 | Hoa) +25 Fes) + He) - OO PO. 
@ = 
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Note that provided the integrand has two continuous derivatives, the composite 
trapezoidal rule has rate of convergence O(h”). 


EXAMPLE 6.9 Numerical Verification of Rate of Convergence 


Consider the integral 
I(f) = sing dz, 
0 


whose exact value is 2. The table below lists 7,(f), the composite trapezoidal rule 
approximation to I(f) computed using a subinterval size of h, for several values 
of h. Note that with each new row, the number of subintervals, n, is doubled and 
the subinterval size is reduced by a factor of two. Further note that the error ratio 
in the last column is approaching 


ga (2hY 
Sl es 
This is exactly what we would expect for a sequence converging with rate of con- 
vergence O(h?). 


no oh TAF) en =l(f)-Th(f)| ean/en 

1  0.0000000 2.0000000 

2  &  1.5707963 0.4292036 4.659792 
4 ©  1,8961188 0.1038811 4.131681 
8 #& 1.9742316 0.0257683 4.031337 
16 3 1.9935703 0.0064296 4.007741 
32% 1.9983933 0.0016066 4.001929 
64 Z 1.9995983 0.0004016 4.000482 
128 3; 1.9998996 0.0001004 4.000120 


me 
Ls) 
00; 


Composite Simpson’s Rule 


Since the basic Simpson’s rule formula already divides the interval [a,b] into two 
pieces, [a, b] must be divided into an even number of subintervals to apply Simpson’s 
rule in a composite manner. Therefore, let n = 2m, define 


b-a_ b-a 
n 2m 
a=atth (0<i< 2m), 


) 
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and @ epely the Simpson's rule formula m times, once over each [z2;~2, 22;| for j = 
1,2,3,...,m. This produces 


Trt 


= FP EE Up aayn) + 4 F051) + fens) sseia)’ pai, 
j=l i=l 


Noo) +455 ley) 42° en) 4+ fltam)| — mL SG) 
==] j=} 


j=l 


Provided f has four continuous derivatives, an analysis similar to that applied to 
the error term for the composite trapezoidal rule can be used to show that there 
exists a € € [a,b] such that f4(¢) = 4 Dye, £)(G}). The error term can therefore 


be written as ; 
hem (b- a)nt 
180 


2 Die) = 
ef = f8), 


where we have used the fact that hm = (b—a)/2. The end result is the composite 
Simpson's rule: 


KA) = 5 | Fao) +4 Hepa) +2 fleas) + je — 0= ah 16) 


j= j=l 


Thus, provided the integrand has four continuous derivatives, the composite Simp- 
son’s rule has rate of convergence O(h*). 


EXAMPLE 6.10 Numerical Verification of Rate of Convergence 


Reconsider the integral 


If) = [sine dz, 


whose exact value is 2. The table below lists S,(f), the cornposite Simpson’s rule 
approximation to J(f)} computed using a subinterval size of A, for several values 
of k. Note that with each new row, the number of subintervals, n, is doubled and 
the subinterva] size is reduced by a factor of two. Further note that the error ratio 
in the last column is approaching 
2n\ 4 
16 = (#) 


This is exactly what we would expect for a sequence converging with rate of con- 
vergence O(h*). 
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nook Sp(f) eh=lT(f)—Sr(f)| — ean/en 
2  £ 2.09439510239 —-0.09439510239 

4 # 2.00455975498 0.00455975498 -20.701792 
8  £ 2.00026916995 0,00026916995  —-16.940059 
16 4 2.00001659105 0.00001659105 —-16.223806 
32. & —2.00000103337 ——-0.00000103337 —-16.055292 
64 & 2.00000006453 —-0.00000006453 ~—-:16.013782 
128 4g 2.00000000403  0.00000000403 —-16.003442 


A Comment Regarding Numerical Verification of Rates of Convergence 


We just got finished with verifying numerically that the composite trapezoidal rule 
has rate of convergence O({h”) and that the composite Simpson’s rule has rate of 
convergence O(h*). We did this by selecting a definite integral for which the exact 
value was known, computing a sequence of approximations and checking that the 
approximation error was reduced by an appropriate factor. Do we always have 
to work with a problem for which the exact solution is known? Fortunately, the 
answer is no. We can still numerically verify a rate of convergence when the exact 
solution is not known, we just have to examine a, different ratio. 

To illustrate the process, suppose we are approximating the value of some 
definite integral, I(f), using the composite trapezoidal rule. Let Ta(f), Trjo(S) 
and Th 4(f) denote the composite trapezoidal rule approximations obtained using 
subinterval sizes of h, h/2 and h/4, respectively. Consider the ratio 


Th(f) — Thal f) 
Trl) — Trya(f) 


Let en = Th(f) — I(f); that is, e, is the error associated with T;,(/). Since the 
composite trapezoidal rule theoretically has rate of convergence O(h*), we should 
expect to find e, * 4e;/2 for sufficiently small h. Consequently, we should find 
Th(f) — Tryolf) _ Taf) — UF) ~ Taya) — 1) 
Try2(f)— Taya(f) — Taya(f) -— LCP) — (Taya) - FCA) 


_ €h 7 &h/2 


Enj2 — €hfa 
_ deaye = enya i 
Ch/2— 7€n/2 


for sufficiently small h. 


EXAMPLE 6.11 Numerical Verification of Rate of Convergence 
Consider the definite integral 


n= [ V1+23 dz. 
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The table below lists composite trapezoidal rule approximations to I(f ) for several 
values of h. Observe that the ratio 


Th f) ~ Tryalf) 
Tnj2(f) — Thya(f) 


approaches 4 as h is decreased, thereby providing numerical verification that the 
rate of convergence is O(h?). 


-T; 
moh TA Reet 
1 1 1.207106781186 4.335258 
2 2 1.133883476483 4.057269 
4 = 1116993293318 4.014294 
8 i 1.112830349496 4.003560 
16 3—~—«1.111793319381 4.000889 
32 8 1.111534292393 4.000222 
64 @ 1.111469550038 

128 +. 1.111453365349 


i 
N 
fo 


A similar process can be used to verify numerically the rate of convergence 
for any composite quadrature rule. In particular, suppose that Q,(f) is an approx- 
imation to the definite integral /(f) obtained using a generic composite quadrature 
formula with a subinterval size of h. If the composite quadrature formula has a 
theoretical rate of convergence of O(h*), then it can be shown (see Exercise 3) that 


the ratio 
Qnr(f) - Qnplf) 
Qayolt) — Qaye2(F) 


should approach b* as A is decreased toward zero. 


Using the Error Term 

The next two examples demonstrate the use of the error term associated with a 
composite quadrature formula for determining the number of subintervals needed 
to achieve a given level of accuracy. In Sections 6.7 and 6.8, we will see how to use 
the error terms to construct algorithms which will control the approximation error 
automatically. 


EXAMPLE 6.12 Approximating the Value of 7 
Since 


1 
/ a dz = tan7} 2|, =tant1=4, 
9 l+z 4 


it follows that the value of x can be approximated by taking four times an approx- 
imation for the integral on the left of the above expression. Suppose we want to 
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approximate a to four decimal places. This means that the absolute error must 
be less than 5.0 x 107°, which in turn implies that the error in approximating the 
integral must be less than 1.25 x 1075. 

TRAPEZOIDAL RULE 

Since h = (b— a)/n, the error term associated with the trapezoidal rule can be 
written in the form 


( = a)h? 7 = (b = a)? it 
Ca" O= SSF re. 
For f(x) = 1/(1 + 2”), it can be shown that 
Be Lf’ (z)| = 2, 


so the value of n must be selected to satisfy the inequality 
(1=0)* 
12n2 


in order to guarantee an error of no more than 1.25 x 1075. The solution of this 
inequality is n > 115.47; therefore, we use n = 116. With n = 116, 


-2< 1.25 x 1075 


1 
/ i “3 dx  0.78539506688536 = => =  3,14158026754144. 
0 


The absolute error in this approximation is 1.2386 x 1075, well within the required 
accuracy. 

SIMPSON’S RULE 

For Simpson’s rule, h = (b — a)/n, so the error term can be written as 


4 ys 
C=O (00) = LF pO, 


It can be shown that maxze(o,1) |f (4)(7)| = 24, so that n must be selected to satisfy 
the inequality 
(1-0)? 
180n4 
The solution of this inequality ism > 10.16. Simpson’s rule requires an even number 
of subintervals, so 12 is the smallest number of subintervals that will guarantee an 
error of no more than 1.25 x 107-5. With n = 12, 


-24 < 1.25 x 107°. 


1 
i = dx * 0.78539816007634 > wT & 3.14159264030538. 
0 

The absolute error in this approximation is 1.3284 x 1078, so we actually achieved 
seven decimal places of accuracy. 

The power of a fourth-order method over a second-order method is quite ap- 
parent here. Simpson’s Rule produced a more accurate result than the trapezoidal 
rule using roughly one-tenth the number of function evaluations. 


474 Chapter 6 Differentiation and Integration 


EXAMPLE 6.13 The Cost of a Warranty 


1 4 
Peet) cee dz 
0 


arises in the determination of the cost of a treadware warranty on tires (Kevin 
Hastings, “Reliability and the Cost of Guarantees,” in Applications of Calculus 
(Resources for Calculus, volume 3), Philip Straffin, editor, MAA Notes #29, The 
Mathematical Association of America, 1993, pp. 152-166]. Suppose the value of 
this integral is needed with a guaranteed error of no more than 0.00001, 

For the composite trapezoidal rule, the number of subintervals must be se- 
lected to satisfy the inequality 


The integral 


(1 ~ 08 < ” 
(1 =O)" maxoses: If"(6)| < 0.00001; 
12Nt ap 
for the composite Simpson’s rule, the number of subintervals must satisfy the in- 
equality 
(= 0)? maxosess |FO(6)| 
180n4 


simp 


< 0.00001. 


It can be shown that maxo<z<1 |f”(6)| < 3.5 and maxoce<y |f“)(6)| < 95, leading 
to the values ntrap = 171 and nsimp = 16. Using these values for n produces the 
estimates: 


Trapezoidal rule: a e-®* dx = 0.8448344011 (error = 4.194 x 1076) 
Simpson’srule: fy e~*" da % 0.8448403780 (error = 1.783 x 1076) 


In computing the absolute error, an “exact” value for the integral was determined 
by using Simpson’s rule with n = 3200. The value obtained in this manner was 
0.8448385947. 


Application Problem: Bags of Pine Bark Mulch 


In the Chapter 6 Overview (see page 429), we found that the number of bags of pine 
bark mulch, each bag with a capacity of 3 cubic feet, needed to cover the irregularly 
shaped plot of land shown in Figure 6.1 to a uniform depth of three inches is given 
by 


1 26 


The integrand, f, is defined in the following table. 
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z 0 1 2 3 4 5 6 7 8 
f(z) 80 100 110 150 160 160 160 160 15.5 


x 9 1 1 2 138 4 15 16 47 
f(z) 15.5 155 15.0 15.0 15.0 145 145 140 140 


x 18 19 2 21 22 28 224 2 26 
f(z) 140 1835 13.0 130 130 120 110 90 60 


Using the trapezoidal rule, we find 


26 h 25 
fle) dem 5 | £0) +20 400) + £028) 


j=l 

L 
5 [8 + 2(347) + 6] 
= 354 square feet. 


Accordingly, we estimate that a - 354 = 29.5 bags of mulch are needed. Since we 
need at least this many bags and we must purchase an integer number of bags, we 
round our estimate up to 30. 

Because there are an even number of subintervals in the data set, we can also 
use Simpson’s rule to approximate the definite integral. We then find 


26 h 12 13 

jE (@) dew 5 | F(0) + 4205) - 2210) —1) + f(26) 
e ; [8 + 4(178.5) + 2(168.5) + 6] 
= 355 square feet. 


Hence, we would estimate that + - 355 = 29.58 bags of mulch are needed. Once 
again, we round this estimate up to 30. 


Periodic Integrands 


Although the composite trapezoidal rule is generally a second-order method (con- 
vergence toward the true value of the integral goes like the square of the mesh 
spacing), there is a class of integrals for which the composite trapezoidal rule is 
extremely accurate. Consider evaluating the integral 


w 
i V1l+t+cos?z dz, 
0 


which computes the length of the sine curve over one half period. The following 
table shows the approximation to this integral computed using the composite trape- 
zoidal rule for several values of n. The error in the approximation is also shown. 
All calculations were performed in Maple using 110 digits of precision. 
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nm Approximation Error 

2 3.7922377959 2.79600 x10~2 
4  8.8199436432 2.54146 x10-4 
8 3.8201977154 7.36424 x1078 
16  3.8201977890 1.90461 x10-#4 
32 3.8201977890 3.74892 x10-27 
64  3.8201977890 4.18779 x107*2 
128  3.8201977890 1.49209 x107}0! 


Since the value of n is doubled for each new approximation, which implies that 
the value of h was cut in half, the error with each new approximation should have 
dropped by a factor of four. The errors in this example definitely seem to be 
decreasing at a much more rapid pace. In fact, each new error appears to be almost 
the square of the previous error. 

To make the analysis of the error more quantitative, let E,, denote the er- 
ror associated with the composite trapezoidal rule computed with n subintervals. 
Theory indicates that E, should be inversely proportional to the square of n; that 
is, 

1 
Ein ~~) eer) 
for some constant c. Taking the logarithm of this expression yields 
log E,, & loge — 2logn. 


Hence, if E,, is inversely proportional to the square of n, a plot of logn versus 
log E, should be linear. Using the values in the above table, the relationship 
between logn and log Ey, is clearly not linear (see Figure 6.9). In fact, the rate of 
decrease in appears to continually accelerate with increasing 1. 

The plot in Figure 6.9 suggests that the error in this example is exponential 
in n, not algebraic. To examine this further, suppose 


Ey & eb” 
for some constants b and c. Again taking logarithms leads to 
log E, = loge + nlogd, 


which implies that for an exponential relationship between E, and n, a plot of n 
versus log E, should be linear. Figure 6.10 shows n versus log E, for the current 
example, verifying an exponential rate of convergence. 

The explanation for this phenomenon lies in the Euler-Maclaurin sum for- 
mula. Provided f is sufficiently differentiable, it can be shown (see Davis and 
Rabinowitz |1]) that 


4 
HA) = PY) ~ Hea) + es") — 1 + 


Banh (pab-0 (0) — f*Y(a)] + OCR), 
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Figure 6.9 Logarithm of the error versus logarithm of number of 
subintervals for the composite trapezoidal rule approximation to the 
length of the sine curve over one-half period. 
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Figure 6.10 Logarithm of the error versus number of subintervals for 
the composite trapezoidal rule approximation to the length of the sine 
curve over one-half period. 


Composite Newton-Cotes Quadrature 


ATT 
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where 7,(f) denotes the composite trapezoidal rule approximation to I{f) and the 
Bo, are constants known as Bernoulli numbers. Hence, if the integrand has odd 
derivatives that assume equal values at the endpoints of the integration interval, the 
composite trapezoidal rule will be more accurate (and possibly much more accurate) 
than second order. Integrands that are periodic with period equal to b—a will clearly 
satisfy these special endpoint conditions. If, in addition to being (b — a)-periodic, 
the integrand is also infinitely differentiable, then the composite trapezoidal rule 
will converge faster than any power of the step size; that is, convergence will be 
exponentially rapid. It is straightforward in this case to show that the integrand, 


f(z) = V1 +cos? zg, is m-periodic and infinitely differentiable. 


Sensitivity to Roundoff Error 


At the end of Section 6.2, the effect of roundoff error on numerical differentiation 
formulas was investigated. It was found that the roundoff error component of 
the total error grew without bound as the step size was reduced to zero. The 
obvious question to ask is whether numerical integration formulas exhibit the same 
sensitivity to roundoff error. 

Consider the composite trapezoidal rule and suppose that 


f(a5) = Flas) + 45, 


for each 7, 0 < 7 <n, where 


f (x5) known function value 
fi (23) floating point representation for f(z;) 
e; roundoff error associated with f(z;). 


Substituting for f(z;) i the composite trapezoidal rule, we obtain 


h nh? 
HfJ=Th) +5 Je 04250 6 ten) ~75(b-a)f"(6). 
roundoff error 
If |e;| < ¢ for all j, then 


roundoff 
error 


E IL-4 2(n— 1) +1]e = mhe 
= (b- aye. 


Hence, the amount of roundoff error introduced by the composite trapezoidal rule 
is bounded independent of the step size. 
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EXERCISES 


1. 
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Provide the details of the transformation of the error term associated with the 
composite Simpson’s rule from 


nm (AD e (b—a)h4 (4) 
99 LG) to. aay oe 


. Derive the composite midpoint rule with error: 


b n yp 
[ to ae=my sent 2" ro, 
a jk 


where h = (b- a)/2n, x; =a+ (27 — 1h and € € [a,b]. 


. (a) Let Qp(f) be an approximation to the definite integral J(f) obtained using 


a generic composite quadrature formula with a subinterval size of h. If 
the composite quadrature formula has a theoretical rate of convergence of 


O(h*), show that 
Qr(f) — Qaypolf) ae 
Qnyo(f) — Qn Ff) 


(b) What value do we expect from the ratio 


Sn(f) — Spya(f) 
Snot) — Sryalf)’ 


where S,(f) denotes the composite Simpson’s rule approximation to the 
definite integral [(f) obtained with a subinterval size of h? 


. Verify that the composite Simpson’s rule has rate of convergence O(h*) by ap- 


proximating the value of ie V¥1+25 dz. 


. (a) Verify that the composite midpoint rule has rate of convergence O(h”) by 


approximating the value of fs V1i+23 dz. 
(b) Repeat part (a) by approximating the value of [> sin x de. 


In Exercises 6-11, verify that the composite trapezoidal rule has rate of convergence 
O(h?), the composite midpoint rule has rate of convergence O(h”), and the compos- 
ite Simpson’s rule has rate of convergence O(h*), by approximating the value of the 
indicated definite integral. 


6. 
7. 
8. 
9. 
10. 


fra da 
yi ee? dz 
ft tan! 2 dz 


2 sing 
Hh x da 


L 
to es da 
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11 
12 


13 
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: fo eva? +9 dx 


. Suppose that there exists a composite quadrature rule, Q(f), with the property 


b Ae 
[ tea=an- GE Mo, 
where a < € < bandh = (b—a)/n. 


(a) What is the rate of convergence associated with this quadrature rule? What 
conditions must the integrand satisfy to achieve this rate of convergence? 
Explain how you would numerically verify the rate of convergence. 


(b) What is the degree of precision of this quadrature rule? Explain how to 
verify the degree of precision. 


(c) What is the smallest value of n needed to guarantee an approximation to 
the value of i. 4 dz to within 1075? Justify your response. 
. (a) Determine the smallest value of n that guarantees that the composite mid- 
point rule approximates the value of a eee] dz to within 1.25 x 1075, 
(b) Determine the smallest value of n which guarantees that the composite 
midpoint rule approximates the value of fe e-® dx to within 10-°. 


In Exercises 14-20, approximate the value of the indicated definite integral using the 
composite trapezoidal rule, the composite midpoint rule and the composite Simpson's 


rule. 


For each method, use the smallest value of n that will guarantee an absolute error 


not greater than 5 x 1075, 


14 


15. 


16 


17. 


18 


19. 


20 


ai. 


24 
. 1 7 dz 
a e” dz 
0 
* de tan7! x dz 


2 sing 
fi =z dz 


14 
So Fram 
fo Vat +9 dx 
: Jo Vi+ 23 ax 


(a) Show that the error associated with the composite Simpson’s rule can be 
approximated by 
rs we HE 
~Te0 [f°"(b) - (a) - 


7 3 6 
[Hint: Recognize that 2h weet f9(¢;) is a Riemann sum for {’ f(z) dz] 
(b) Show that the error associated with the composite midpoint rule can be 
approximated by 
hn? 


= [Ff ®) - FO). 


22. 


23. 
24, 


25. 


26. 


27. 
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Consider the definite integral f. sin(,/mz) dz. Numerically determine the rate of 
convergence of the composite trapezoidal rule for each of the following integration 
intervals. 


(a) [a,b] = [0,1] 

(b) [a, 8] = [n/4, 9n /4] 

(c) [2,6] = fr, 2m] 

(d) Explain any variation among the rates of convergence obtained in parts (a), 
(b), and (c). 

Repeat Exercise 22 for the composite midpoint rule. 

Consider the definite integral pee + 2° — 3a? — 4x — 1) de. 

(a) Numerically determine the rate of convergence of the composite trapezoidal 
rule when applied to the given integral. 

(b) Numerically determine the rate of convergence of the composite midpoint 
rule when applied to the given integral. 

(c) Provide an explanation for the results obtained in parts (a) and (b). 

With an optimal tilting strategy, the theoretical lower bound for the time needed 

to pour milk from a plastic pouch into a pitcher (N. Curle, “Liquid Flowing from 

a Container,” in Mathematical Modeling, Andrews and McLone, eds., Butter- 

worths, 1976, pp. 39-55) requires the calculation of the integrals 


0.8355 1 2 

: 2 

/ (1 + a7)i/4 dr and i aS dz. 
0.1763 o.g355 2°(1 + 2?) 


Approximate the value of cach integral with an absolute error no greater than 
10". 

Using Newton’s Second Law, it can be shown that the period, T (the time for one 
complete swing), of a pendulum with length L and maximum angle of defiection 


5 is given by 
am /2 
‘ =a f= | a 
9 Jo 1 —k? sin? x 


where & = sin(@9) and g is the acceleration due to gravity. To calibrate the timing 
mechanism in their top-of-the-line model, a grandfather clock manufacturer needs 
to know the period of a pendulum with L = 1 meter and 6) = 12° to within 
107° seconds. Calculate the period to the required accuracy. 
Ammonia vapor is compressed inside a cylinder by an external force acting on 
the piston. The following data give the volume, v, measured in liters, and the 
pressure, p, measured in kilopascals. 

v 0.50 060 0.72 084 096 1.08 1.25 

p 1400 1248 1100 945 802 653 500 


The work for the process is given by the integral 


1.25 
| p dv. 
0.5 


Estimate the work done in the following ways: 
(a) using the trapezoidal rule; 
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(b) by passing a cubic spline through the data and then integrating the spline. 
28. Values of the volume (v, measured in cubic inches) and the pressure (p, measured 
in pounds per square inch) of a gas as it expands from a volume of 1 cubic inch 

to a volume of 2.5 cubic inches are presented in the table below. 

v 1.00 1.25 150 41.75 2.00 2.25 2.50 

p 68.7 55.0 458 393 344 305 27.5 


The work done by the gas as it expands is given by 
2.50 
We= p dv. 
1.00 
Estimate the value of this integral. 
29. Approximate the value of the integral 


1 
[ 2x f(x) dx, 
0 


where f is given by 

£ 0.0 O1 02 038 O04 O85 O06 O7 O8 O09 1. 

f(z) 0.667 0.671 0.689 0.711 0.742 0.790 0.841 0.910 0.975 1.052 1.130 
This integral arises in computing the mean flight distance of birds, randomly 

dispersed throughout a circular region, to all other points of the region (see J. F. 

Wittenberger and M. B. Dollinger, “The Effect of Acentric Colony Location on 

the Energetics of Avian Coloniality,” American Naturalist, 124, 189-204, 1984). 


6.6 GAUSSIAN QUADRATURE 


In the previous two sections, the concept of Newton-Cotes quadrature was devel- 
oped. In this approach to numerical integration, the value of the definite integral 


b 
I(f) = i; fl) de 


is approximated by the quadrature rule 
mr 
I,(f) = >> wif (ai). 
1=0 


The x; are called abscissas and are selected as equally spaced points within the 
integration interval (a, 6]. The weights, w;, are then found by integrating the poly- 
nomial that interpolates the integrand at the abscissas. The degree of precision of 
the resulting rule is n when n is odd and n + 1 when n is even. 

In this section, we will develop the concept of Gaussian quadrature. In this 
approach to numerical integration, the abscissas and weights are selected so as to 
achieve the highest possible degree of precision. 


Section 6.6 Gaussian Quadrature 483 


Method of Undetermined Coefficients 


The method of undetermined coefficients is essentially the brute force method for 
developing Gaussian quadrature rules. It involves a straightforward application 
of the definition of degree of precision, and proceeds as follows. Given a positive 
integer n, we wish to determine 2n numbers—the abscissas 2), £2, 23,...,Zn, and 
the weights wy), w, w3,...,W,—so that the summation 


wif (zy) + wef (ee) + waf (a3) +--+ wn f(r) 


provides the exact value of i. fiz) da for f(z) S 1p Byes ge?) Tn other 
words, the quadrature rule will have degree of precision equal to 2n — 1. Applying 
each of these conditions produces a system of 2n equations, which will be nonlinear 
for alln > 1. 


To demonstrate the method of undetermined coefficients process, let’s develop 
a Gaussian quadrature rule with n = 1. We want the approximation formula 


b 
I (Pease) 


to have degree of precision 1; that is, this formula should obtain exact results for 
all constant and for all linear functions. Applying these two conditions produces 
the following system of equations: 
1: w =f’ dx =b-~a 
f(a)=a: wit, = f° cde = 3(b? — a’), 


The solution of this system is w, = b—a and x; = (a+ b)/2, so the resulting 
Gaussian quadrature rule is 


[ teyar~o-a (24*), 


which we should recognize as the midpoint rule. The error associated with this 
quadrature rule is therefore 


(b= a)° 
24 


£8), 


wherea<& <b. 

For determining Gaussian quadrature rules with n > 1, it is to our advantage 
to replace the general integration interval of [a,b] with a standardized interval, the 
most common choice for which is [-1,1]. With such an interval we can exploit 
the symmetries in the problem to simplify the solution of the nonlinear system of 
equations for the abscissas and weights. The conversion from the integral 


: " Ha) ae 
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/ F(t) dt 


is most easily accomplished by the change of variable 


to an integral of the form 


b-a a+b 
le ——- 
2 a 2 


This formula comes from the equation of the line passing through the points (—1, a) 
and (1, 6}, where the first coordinate is the t-coordinate and the second is the x- 
coordinate. The resulting relationship between the two integrals is then 


b b— 1 = 
[ t@a-= rf (Gt) at. 


Let’s now construct the two-point Gaussian quadrature rule 


A; 
i: _Fe)d wafer) tas a). 


Since this formula is to have degree of precision equal to 2(2) — 1 = 3, the weights 
and abscissas must satisfy 


1 
f(z) =1 wi tue= fi de=2 
f= wit1 + were = f_, edz =0 
f(x) = 2: wre} + woe} = JX, 2? de = g 
f(a) = 2°: wth + wet} = fo, ot dr =0. 
The symmetry of the integration interval about zero suggests v2 = —2, and 


Ww, = Ww. Substituting these relations into the system, the equations for f(z) =z 
and f(x) = 2° are satisfied exactly, and the remaining equations take the form 
2w; = 2 and 2u,2? = 2/3. The solution of the system is then w, = w2 = 1, 
2, =—V/1/8 and 22 = Vira giving the quadrature rule 


[ fedex f (-V3) +f (3) 3 


Unlike our previous example, this quadrature rule is not one that was devel- 
oped in an earlier section, so we will have to derive the error term from scratch. To 
accomplish this, first note that if we interpolate the integrand, f, at 2) = —,/1/3 
and x2 = J1/8 and then integrate the resulting interpolating polynomial, we re- 
produce the two-point Gaussian quadrature rule. This implies that the error term 
associated with the two-point Gaussian quadrature rule is 


[ fle1, 22, 2|(x — 21) (2 — x) de. 
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From this starting point, we proceed as follows. Since 


f[21, 22, 2] = f[21, 22, 21] 


z—2Xy = f[t1, 22,21, 2], 


we may replace f[%1,22,2] by f[%1, 22,21] + f (21, 22,21, 2](2 — z,). This replace- 
ment transforms the error term to 


1 1 
[ flensnzlle-a1)(e~ 2) de+ [ . fli, v2, 01, 2](x — 2 y)*(a — 2) de. 
= =-1 


The first of these integrals is equal to zero. In the second integral, we use the 
equation 
Flea 23, 01,2] _ f[21, 2,21, £2] 
De 29 


= f[@1,22, 21,22, 2] 


to replace flzi,22,21,2] by flzy, 22,21, £2] + f[v1, 22,21, 22, 2](x — Z2). Now the 
error term takes the form 


1 1 
/ fai, 22,21, ¢9](e—21)?(2—22) as | f[t1, 22,21, 22, 2](@—21)? (2-22)? de. 
-1 -1 


The first of these integrals is again equal to zero. Finally, an application of the 
Weighted Mean-Value Theorem for Integrals to the second integral leads to 


: d 1 1 yay 
[i soa-s(-f9) +71 5) + lO 


where -l1<€ <1. 
Converting this rule back to the more general integration interval [a, b| pro- 
duces 


ato lb-a a+b lb-a 1 df 

( ees Jar (e yi 3 )+ HO] 
a+b lb-a a+b ipa (b~a)o déf ,» 
( 2 - 3 2 )os (Spe a J| 4320 aw) 


where a < € < b and, in the last line, the chain rule has been used to convert 
derivatives with respect to t in the error term to derivatives with respect to x: 


= de 2 dx‘ 


d_ddz_ b-ad dt fb-a\* dé 
dt da-dt 2d dt4 
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EXAMPLE 6.14 Approximating In(2) 


One way to approximate the value of In(2) is to approximate the value of the integral 


2 
/ 1 ae. 
1 zx 


Using the two-point Gaussian quadrature rule and noting that for this problem 
a=1,b=2, and f(x) = 1/2, we obtain the approximation 


ay 2~1|/2+1 (ee fo 4d fiz=3 ” 
[pee> (2 32, 7a t¥3 2° 
- ~1 
_1//3_va\" (3,3 
~2]\2 6 2° 6 


= 0.6923076923. 


Even with only two function evaluations, the absolute error in this approximation 
is 8.394 x 1074. 


If greater accuracy is required, we can develop a formula with a higher degree 
of precision, or we can work in a composite manner, dividing the interval |a, | 
into subintervals and applying the lower accuracy formula on each subinterval. in 
Exercise 2 of Section 6.5, the composite midpoint rule, 


b aD _ 2 
[ feide= mY He)+ bor 


where h = (b~a)/2n, cj = a+ (27 - 1)h and a < € < 6, was considered. For 
the two-point Gaussian quadrature rule, let h = (6 — a}/n and 2} = a+ gh for 
j =1,2,3,...,n. Applying the basic quadrature rule on each subinterval [z;-1, 25] 
yields 


Ag h 1A h lk (b—a)h? cg 
20S [i(2-$-98) 7-378) Saree 


where a < € <6. Details of this derivation are left as an exercise. 
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EXAMPLE 6.15 Approximating the Value of x 


In the previous section, the composite Simpson’s rule was used to approximate the 
value of x to four decimal places by estimating the value of the integral 


aoe | 
[ i 


to within 0.0000125. n = 12 subintervals were required to guarantee the desired 
approximation error, and the resulting approximations for the value of the integral 
and for a were 


i 
i = dz = 0.78539816007634 
9 l+z 


and 
mw %& 3.14159264030538. 


The absolute error in this approximation to 7 is 1.3284 x 107°. For future reference, 
note that 12 subintervals for Simpson’s rule corresponds to 13 function evaluations. 

For the composite two-point Gaussian quadrature rule, h = (b — a)/n, so the 
error term can be written as 


(b-a)h* __ (b-a)* ¢ 
4320 ee 4320n4 fC). 


Since maxze¢(0,1) | f'(x)| = 24, it follows that n must be selected to satisfy the 
inequality 
(1-08 


Tacs -24 << 1.25 x 107° 


in order to guarantee an estimate for the integral with an absolute error no larger 
than 0.0000125. The solution of this inequality is n > 4.59; therefore, we will use 
n=5. Withn=5, 


1 
i Eel ie dx = 0.78539817044636 
0 1 + x? 


=> nm & 3.14159268178543. 
The absolute error in this approximation to m is 2.8196 x 107°, which is slightly 
larger than that obtained using Simpson’s rule. However, with five subintervals, 


the composite two-point Gaussian quadrature rule uses only 10 function evaluations 
(two per subinterval), compared to the 13 used by Simpson’s rule. 


(The remainder of this section may be omitted without loss of continuity.) 
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Theoretical Development 


Although the method of undetermined coefficients is conceptually straightforward, 
it tends to give the impression that there is no relationship among Gaussian quadra- 
ture rules. This, however, is not the case. There is an underlying theory which 
connects all Gaussian quadrature rules. 

To develop this unifying theory, we will consider the more general class of 
integrals of the form 


b 
| f(e}w(2) dz. 
a 
The function w is known as a weight function. 


Definition. The function w is called a WEIGHT FUNCTION on the interval 
(a, b] if it satisfies the three properties: 


(1) w is integrable on [a, 6]; ze., ib w(x) dz exists: 
2) w(x) > 0 for all z € [a,b]; and 


( 
(3) w(z) = 0 at isolated points only; that is, there is no open interval 
(z1, 22) C [a,6] such that w(x) = 0 for all x € (a, 22). 


Ww 
w 


The weight functions (and corresponding intervals) most commonly encoun- 
tered in practice are w(x) = 1 on [-1,1], w(z) = e7* on (0,00), w(x) = e-® on 
(00,00), and w(x) = (1 — 2?)-!/? on [-1, I]. 

Associated with each weight function/integration interval pair is a special 
family of polynomials, unique up to a normalization factor. These functions lie 
at the heart of Gaussian quadrature. For a given n, the family consists of n + 1 
polynomials ¢o, ¢:, $2,...,¢n, where the degree of each ¢, is equal to k. Hence, 
the family consists of one constant function, one linear, one quadratic and so on. 
The most important property of these functions is ORTHOGONALITY: 


b 
[ esta)os(eyw(e) de =0 


whenever i # 7. The families of orthogonal polynomials which correspond to the 
specific weight functions mentioned above are summarized in Table 6.4. For an 
arbitrary weight function, the corresponding family of orthogonal polynomials can 
be generated by applying the Gram-Schmidt process (see any text on linear algebra) 
to the polynomials 1, z, 27,...,2". 

We are now in a position to establish the main result regarding Gaussian 
quadrature. To simplify notation, let Il, = {all polynomials of degree <n}. In 
the proof of the following theorem, we will need this basic fact from linear algebra: 
if B = {¢0, 61, ¢2,---,¢n} C Tl, is an orthogonal family, then B forms a basis 
for Il,. In other words, for any polynomial p € TI, there exist unique constants 
Co, C), C2,-.-,Cn such that p can be expressed as 


mr 
p= Cobo + 1b + Cod. +--+ + endn = > cide. 
i=0 
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w(s) =1 Legendre polynomials Po{(x) = 1, P,(z) = 
on [-1, 1] P, (x) Po(x) = (8x? — 1)/2 
P3(x) = (5x9 — 3x) /2 


Recurrence relation: Pr1i(x) = 84) 2P,(x) — By Paix) 


w(z) = e7* Laguerre polynomials Lo(z) =1, Ly (2) =1—2 
on [0, 00) L(x) Lo(x) = 22 —42 +2 
L3(x) = —23 + On? — 18246 
Recurrence relation: Ln4i(z) = (1 +2n — c)La(x) —n?Ln_1(z) 
w(x) = e Hermite polynomials  Ao(x) = 1, A(x) = 2x 


on (—00, 90) Hy, (2) Ho(z) = 4x? ~—2 
H3(z) = 82° — 122 


Recurrence relation: Hy41(x) = 22H,(x) — 2nHy_1(2) 


w(x) = (1—27)-4/? Chebyshev polynomials To(x) = 1, T(z) = 
on [-1, 1] T(x) T(x) = 22? —1 
T3(x) = 4a3 — 32 
Recurrence relation: Ty+41(z) = 22T,(r) — Tr_-1(x) 


TABLE 6.4: Common Families of Orthogonal Polynomials 


Theorem (Gaussian Quadrature). Let w be a weight function on [a, 6, 
let n be a positive integer and let {¢9, 61, 62,.-., dn} C I, be an orthogonal 
family with degree of ¢, = k for each k. Let 2, 22, %3,..-,2n be the roots 
of @,(x) and define 


nr 
L,(z) = I] A for each 7. 
Then the corresponding Gaussian quadrature formula is given by 
nm 
1(f) = = f(a)w(2) de = In(f) = wif (ei), 
i=l 


where w; = i L;(x)w(x) dc. The formula /,,(f) has degree of precision 2n—1. 
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Proof. The only thing that needs to be proven is that the quadrature rule 


Inf) = So wif (@:), 


t=] 


where w; = te L;(x)w(a) dx, has degree of precision equal to 2n — 1. We will 
establish this result in three steps. First, we will show that the quadrature 
rule has degree of precision at least n — 1, then that it has degree of precision 
at least 2n — 1, and finally that the degree of precision is exactly 2n — 1. 

(a) Degree of Precision > n — 1 

Let g € ny. Note that the functions £;{2) defined in the statement of the 
theorem are the Lagrange polynomials associated with the roots of the poly- 
nomial ¢n(x). The function 7°, g(2;)Li(x) is therefore the Lagrange form 
of the interpolating polynomial that interpolates g at the roots of ¢,. Since 
g is itself a polynomial, it follows from the uniqueness of the interpolating 
polynomial that g(x) = S77_, g(s)Li(z) for all a. 

Then 


nr 


b b n 
1(g) = | g(x)w(x) dx = 'S° 9(2:) i L,(x)w(z) dz = $0 wig(2i) = In(9)- 


i=l ¢ t=] 


I, therefore produces the exact integral for any polynomial of degree up to and 
including n — 1, so, by definition, the quadrature rule has degree of precision 
at least equal to n — 1. 


(b) Degree of Precision > 2n — 1 
Let p € Ilen—1 and divide p by ¢» to obtain 


p(t) = 4(x)on(x) + r(z), 


where g,r € IIn—1. Since the polynomials {o, $1, ¢2,---,¢n-1} form a basis 
for IIm—1 and q € I,_3, there exist constants co, C1, C2,---,n—1 such that 
a= Dizo ids. Then 


I(p) = I(aon +7) 


n—-l b 
= Sa | e)Pnl(a)u(a) ae + 10 
t=0 4 


=0+J(r)=J(r), 


where the summation term vanishes due to orthogonality. Now, r € II,-1 so 
that I(r) = In(r) by part (a). Since the 2; are roots of bp, it follows that 


p(zi) = q(x) bn(wa) + (ai) = OF T(aa) = (zi), 


which implies I,(r) = [,,(p). Combining these results yields / (p) = In(p). 
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(c) Degree of Precision = 2n -1 
To establish that the degree of precision is exactly 2n — 1, it is only necessary 


to find one polynomial of degree 2n for which I,(p) # I(p). Consider the 
polynomial ¢2(x), which has degree 2n. Note that 


b 
164(0) = | Ala)u(e) de > 0 


but 


In($2,(z)) = Dwi (2) = 0. 


Remarks. (1) For a given number of quadrature points n, weight function w, 
and integration interval [a, 6], the quadrature rule J,(f) = S7y_, wif (xi) is 
unique in the sense that it is the only quadrature rule based on the weight 
function w(x) with degree of precision equal to 2n — 1. 

(2) If f has 2n continuous derivatives, then there exists € € [a,b] such that 


Qn 
a2 (2n)! 


I(f) = In(f) + fEr(E), 


where on = f 7 $2(z)w(z) dz and an is the leading coefficient of ¢,(z) (see 
Exercises 31 and 32). 


In establishing the previous theorem, we have overlooked one important item. 
The proof implicitly assumes that ¢, has n simple roots, all of which lie inside the 
interval (a,b). The next theorem establishes that this is true. 


Theorem. If {¢0,¢1, ¢2,...,¢n} is a set of polynomials with the following 
properties: 

1. ¢, has degree k for each &, and 

2. the set is orthogonal with respect to the weight function w(z) on [a, 5], 


then for each k, ¢, has precisely k real roots, which are all simple and all lie 
in (a, 6). 


Proof. The proof of this theorem will proceed in two steps. First we will 
establish the existence of real roots for @,, and then we will count those 
roots. 


(a) Existence of Real Roots 
Since the degree of ¢o is zero, it follows that ¢9 = a for some nonzero constant 
a. Then, for k > 0, 


b b 
0= i bo(z) bx (x)wlx) dx =o f os {z)wle} dz, 
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where the first equality holds due to orthogonality. However, 


w(x) > 0 on [a, 6] 


integral = 0 \ = $; must change sign in (a,b). 


Since ¢, is a continuous function on (a,b), the Intermediate Value Theorem 
guarantees the existence of a real root somewhere on (a, b). 
(b) Count Roots 
Suppose ¢, changes sign at exactly j points Ty, 72, 73,---,7; in (a,b) such 
that 

Gr <7rg<73 cs <r <b. 
Without loss of generality, assume that ¢,(r) > 0 on (a, 71). Then oy alter- 
nates sign on (71,72), (r2, 73), (73,74), +63 (r;,b). Define the auxiliary function 


p(x) = by » [Ie 7 ri) 


By construction, p(x) and $ (2) have the same sign for all x € [a,6]. From 
this it follows that 


b 
/ p(x)ox(z)w(a) dx > 0. (1) 
Suppose that 7 < k. Since {¢o. ¢1, d2,.-., 95} forms a basis for Il;, there 
exist constants C9, C,, C2, -.., ¢; such that p(x) = 7_, adi(x). Substituting 


this expression into (1) yields 


b 
[veevoete)ne) de = [3 cxe(a)outenute) ae 


4=0 
ae a dil) $x(a)w(x) de 
i=0 7 


where the final equality follows from the assumption that 7 < k and from 
orthogonality. We have therefore arrived at a contradiction, which implies 


that 7 > k. 

However, each 11, 72, 73,..-,7; is a root of ,, and ¢, has degree k, which 
means $, can have at most k roots. Therefore, we must have j = k, so that 
$x has k simple real roots in (a, b). Oo 


EXERCISES 
1. Approximate the value of each of the following integrals using the two-point 
Gaussian quadrature rule (the basic formula, not the composite rule). Verify 
that the theoretical error bound holds in each case. 
(a) fie"? de (b) fo rpae de 
(c) fo sinadz (d) ie tan} ¢ da 
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2. Derive the composite two-point Gaussian quadrature rule: 


where h = (b-—a)/n, 2; =at+jhanda<€<b. 
3. Approximate the value of each of the following integrals using the composite 


two-point Gaussian quadrature rule with the specified number of subintervals. 
Verify that the theoretical error bound holds in each case. 


(a) (on en dz, n=2 (b) fs Td dz, n=2 
(c) Ip sinzgdz, n=3 (d) Hie tan tadz, n=3 
4. Let 2} = —4/1/3 and rq = 4/1/3. Show that 


(a) rhe flx1, x2, 21](x — £1)(x — x2) dz = 0; 
(b) te fli, 22, £1, ¢2|(2 — 21)?(a — x2) da = 0; and 


(c) es fle1, «2, £1, £2, 2](@ — 21)? (2 — 22)? dz = ds f (6), wherea<& <b. 
5. (a) Derive the three-point Gaussian quadrature rule 


[ f@e-q (-/2) + 5F(0) + 51( V2) +m 


where -1<€ <1. 

(b) Convert the quadrature rule from part (a) to the general integration interval 
[a, Bj. 

(c) Derive the composite three-point Gaussian quadrature rule. [Note: The rate 
of convergence should be O(h®).| 


6. Use the three-point Gaussian quadrature rule to approximate the value of the 
definite integral i i de. What is the absolute error in this approximation? 

7. Repeat Exercise 1 using the three-point Gaussian quadrature rule. 

8. Repeat Exercise 3 using the composite three-point Gaussian quadrature rule. 


In Exercises 9-16, yoy that the composite two-point Gaussian quadrature rule has 
rate of convergence O(n ) and the composite three-point Gaussian quadrature rule has 
rate of convergence O(h 2) by approximating the value of the indicated definite integral. 


9, ee Vi+ 23 dx 
10. {5 sin dx 
1. fetde 

12) p e* dx 
13. ie tan”! eda 
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14, f? $22 do 


15 


1 1 
So Fag 


16. fo ava + 9dr 


In Exercises 17-24, approximate the value of the indicated definite integral using the 
composite two-point Gaussian quadrature rule and the composite three-point Gaussian 


quadrature rule. For each method, use the smallest value of n which will guarantee an 
absolute error of no greater than § x 10-°, 


17 


18. 


19 


20. 
21. 
22. 
23. 
24, 
25. 


26. 
a7. 


2 
i hae 
i e 7 dr 
: te tan b adr 


ff ape 

rh 7s da 
fe wa? +9 de 
Jo Vite de 
£ en” dx 


Consider the definite integral 2 sin(,/7z) dz. Numerically determine the rate 

of convergence of the composite two-point Gaussian quadrature rule for each of 

the following integration intervals. 

(a) [a, 4] = (0, 2] 

(b) [a,b] = [7 /4, 94/4] 

(ce) [a,b] = [x, 2r] 

{d) Explain any variation among the rates of convergence obtained in parts (a), 
(b), and (c). 

Repeat Exercise 25 for the composite three-point Gaussian quadrature rule. 


Consider the definite integral vie z7e~* de. Numerically determine the rate of 

convergence of the composite two-point Gaussian quadrature rule for each of the 

following integration intervals. 

(a) [a, 6] = (0, 2] 

(b) [a, b| = [3 — V3,3+ V3] 

(c) la, b] — [-1, 4 

(d) Explain any variation among the rates of convergence obtained in parts (a), 
(b), and (c). 


Optional Material 


28 


. (a) Find the abscissas, z;, and the weights, w;, of the three-point Gauss-Hermite 
quadrature formula 


i. e-* f(z) dx = wy f(z.) + wef(ze) + w3f(23). 


co 


(b) 


(b) 


30. (a) 
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Use the fact that the Hermite polynomials, Hp(z), are orthogonal in the 
corresponding inner product 


ne / "* HeNalah di 


and that H3(z) = 8x3 — 122. Find the weights by undetermined coefficients 
using the values 


co 2 CO ie,2) 
/ e° dzr= Va i ze? dx =0 : we dx = ve 
—oco —co 00 2 
Use your results from part (a) to evaluate both 


2 ee a 1 
i —— de and / Ty ae. 
ph he a Pe 


Find the abscissas and the weights of the three-point Gauss-Chebyshev 
quadrature formula 


1 : 
a Ge dz = wif(x1) + wef (x2) + wa f(x). 


Use the fact that the Chebyshev polynomials, T(x), are orthogonal in the 
corresponding inner product 


1 
f(z)g(z) 
g- | MM@a 
(a= f Hee 
and that T3(x) = 42° — 3x. Find the weights by undetermined coefficients 
using the. values 


1 1 1 1 2 a 
i eile rag / —2 de =0 / us 
-1Vl-2? -1v1i-—-2? -1 l-z 2 


Use your results from part (a) to evaluate 


| cost ae 

-1V l-<z 2 ; 

Find the abscissas and the weights of the three-point Gauss-Laguerre quad- 
rature forroula 


¢ e” f(a) dz % wif (21) + wef (ee) + wsf (23). 


Use the fact that the Laguerre polynomials, [n(x), are orthogonal in the 
corresponding inner product 


Gas [ e-® f(x)g(x) dx 


496 Chapter 6 Differentiation and Integration 


and that L3(z) = ~x* + 92? ~ 18¢ + 6. Find the weights by undetermined 
coefficients using the values 


(b) Use your results from part (a) to evaluate both 


OO et oO 1 
pacers d =e 
| 1+22 ci [ 1+? sa 


31. Let w be a weight function on [a, 5], let {40, $1, d2,.-- én} C Un be an orthog- 
onal family with respect to w with degree of ¢, = k for each k and let 21, £2, 
23,...,%n be the roots of én(x). Show that 


6 n 
y w(x) [[@ ~—2;)dz =0 


i=l 


and 
nr 


db k 
[ w(a) ] [t - =)? I] (z-—aj;}dx =0 
a: i=l 


jHk+1 
for Kk = 1,2,3,...,n—1. 

32. Let w be a weight function on [a,b], let n be a positive integer, let {do, O1, $2; 
---+n} C In be an orthogonal family with respect to w with degree of , =k 
for each k, and let In(f) denote the corresponding Gaussian quadrature rule for 
approximating 


b 
1A) = f° Hayw(a) de 


Suppose f has 2n continuous derivatives. Show there exists ¢ € |a, b] such that 


a 


Ap Sts Zt o» 


where On, = i $2 (x)w(z) dx and an is the leading coefficient of on(z). 


6.7 ROMBERG INTEGRATION 


Though composite Newton-Cotes formulas and composite Gaussian quadrature for- 
mulas come equipped with error terms that can, in principle, be used to determine 
the number of subintervals needed to guarantee a given accuracy, the amount of 
work required to determine the number of subintervals can be greater than the 
amount of work required to compute the final approximation. The error terms also 
tend to provide pessimistic bounds, thereby requiring more work than is actually 
necessary. 
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In practice, we would prefer to have a quadrature scheme that automatically 
determines the amount of work (measured in terms of the number of function 
evaluations of the integrand) needed to achieve a desired level of accuracy. The 
scheme should also be able to report an estimate of the error in the computed 
approximation. In this section, we study one such scheme known as Romberg 
integration. 


The Basic Process 


Romberg integration is extrapolation applied to the composite trapezoidal rule. 
Recall from Section 6.3 that extrapolation is a general procedure which takes two 
approximations, computed using different values of some parameter h, and combines 
them in such a way that the result is a higher-order approximation than the original 
values. To apply this process, the power of h which appears in the leading term of 
the error of the original approximations must be known. Extrapolation works well 
with the composite trapezoidal rule because, by the Euler-Maclaurin summation 
formula. (Section 6.5), 


b 
| f(a) de =Tp(f) + Kyh? + Koht +---+ Kyh?” + o(h?”) 


provided the integrand is sufficiently differentiable; that is, provided f is contin- 
uously differentiable 2n times on [a,b]. Here, the K,’s are constants which are 
independent of h. Hence, the original approximations are second order, the first 
extrapolated values will be fourth order, the next extrapolated values will be sixth 
order, and so on. 

The conventional notation used for Romberg approximations is Ry ;, where 
the first subscript controls the step size, h, the second subscript indicates the level 
of extrapolation, and calculations are organized into the usual extrapolation table 
as follows: 

Aaa 

Rot Roa 

Ran Ra,2 Ra,3 

Ray Raja R43 Ra,a 


The first column of the table contains the composite trapezoidal rule approxima- 
tions; that is, 
: b-a 

Rea = Ti(f), with h = QR-1" 
Since Ry; is an O(h?4) approximation for each j and the step size is cut in half 
with each new row, the remaining entries in the table are computed according to 
the formula 
49-1 Re 5-1 — Re—i,j-1 


Reg = 47-1 —] 
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EXAMPLE 6.16 Approximating In(2) 


A five row Romberg Integration table for the value of 


2 
/ es dz, 

) £ 
whose exact value is In(2), is shown below. A single subinterval was used to com- 
pute the first entry in the first coluran. The error corresponding to each value in 
the integration table appears in the second table. Although the best trapezoidal 
rule approximation is correct to just 3 decimal places, the final tabulated value is 
accurate to 8 decimal places. Further, note that the errors in the first three columns 
drop by factors of roughly 4, 16, and 64 with each new row. This is exactly what 
we would expect from second-, fourth-, and sixth-order approximations. 


Romberg Integration Extrapolation Table 


6.7500000000 

G.7083333333  0.6944444444 

0.6970238095 0.6932539683 0.6931746032 

0.6941218504  0.6931545307 0.6931479015 0.6931474776 
0.6933912022 0.6931476528 0.6931471943 0.6931471831 0.6931471819 


Corresponding Errors 


5.6853 x10~? 

1.5186 x107* 1.2973 x1073 

3.8766 x1073 1.0679 x107-4 2.7423 x 1075 

9.7467 x1074 7.3501 x10-® =7,.2092 x10-7 2.9708 x10~7 

2.4402 x1074 4.7226 x10-7 1.3737 x1078 §=2.5120 x107° =1.3568 «1079 


Efficient Calculation of First Column 


When computing the first column of the Romberg integration table, it is not nec- 
essary to start each trapezoidal rule calculation from scratch. In fact, all of the 
information used to compute one trapezoidal rule approximation can be reused in 
the calculation of the approximation with half the step size. To see how this works, 
suppose that Z},(f) has been computed, where h = (b — a)/n for some n, Then 


2n-1 


12 lias p(a+s5) +2 SS s(a+55) +10) 


j=2,even j=l,odd 
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n-l1 Qn-1 
= "2 i a)+25~ flat kh) y+ r0]+3 SS (a+35) 


k=1 j=l,odd 


= Sif ng ae §(a+95 ?: 


22 llodd 


The second term in this last expression consists only of those function evaluations 
which were not used in the computation of T,,(f). Working in this fashion, we 
can calculate the entire first column of the Romberg integration table for the same 
number of function evaluations needed just for the very last entry. 


Error Estimate 


The Romberg integration table can be computed either in a column-by-column or 
in a row-by-row manner. The principal advantage of computing the table one row 
at a time is that once a row has been completed, the error in the final value can be 
estimated and used to decide whether a new row should be computed. To establish 
an error estimate, recall from Section 6.3 that the formula 


AP Rt Rat 


Ran _ qn—-1_] 


arises from 
ane! 


4n— 


where the second term on the right is the extrapolation estimate for the leading 
term in the error of Rp-1,,-1. Since Ry,» should be a better approximation than 
Rn—1,n-1, we see that the difference |Rzn—Rn—1,,-1| can serve as an (albeit crude) 
estimate for the error in Ran. 

The formula 


Ran = Ry- 1n— i ge re Ty (Ban- 2 ie Ry- 1yn—1) 


age tie ee 


Ravn = qn-l_] 


can also be rearranged as 


Ran = Rayn- a a eT qn- aq lRana1 — Rn-1n-1); 
where the second term on the right is the extrapolation estimate for the leading 
term in the error of R,,n-1. Therefore, the difference | Ry» —Rnn—1| can also serve 
as an estimate for the error in Ryn. 
Since several approximations and assumptions have been made along the way 
to arrive at these error estimates, it is advisable, in practice, to be conservative and 
strike some balance between the two estimates. Note that 


Rn oe Rn—1jn-1 
Ran = Ran-1 Se “—gn-t 


1 
Jr (Ban = Rn-1,n-1) = 
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or one of the error estimates is equal to the other divided by 4"7). As a compromise, 
we will “split the difference” in the denominator and use (finn — Rn-an—1)/2"! 
as an, error estimate in the stopping criterion for Romberg integration. In other 
words, we will construct the Romberg integration table one row at a time. After 
each row has been completed, the condition 


(Ran _ Hien y/o? <6 
where ¢ is a specified convergence tolerance, will be tested. If this test passes, the 
algorithm terminates, returning 2,7 as the estimate for the value of the integral. 
If this test fails, another row in the table is calculated. 
EXAMPLE 6.17 Approximating 7/4 


Suppose we wish to use Romberg integration to approximate the value of the definite 


integral 
1 
1 
i: 142?’ 


whose exact value is 1/4, to within an absolute error of 5 x 10-®. We start out 
computing the first two rows of the Romberg integration table, which are 


Ry =T(f) = 0.7500000000 
Roa = Tijo(f) = 0.7750000000 —-Ra,y = 0.7833333333 


At this point, we estimate the error in Re 2 to be 


Vea — Paul = 0.0166666666. 


Since this is larger than « = 5 x 1075, we compute the next row of the table. 
The third row of the Romberg integration table contains 


R31 = Ti 4(f) = 0.7827941176, Raz = 0.7853921569 
and R33 = 0.7855294118. The error estimate associated with 3,3 is 


[Ras = Raal _ s sone x 1074, 


which is still too large. We therefore move on to compute the fourth row of the 
integration table. 
The fourth row of the Romberg integration table contains 


Ray = Tiel f) = 0.7847471236, Rao = 0.7853981256,  Ra,3 = 0.7853985235 
and Ry,4 = 0.7853964459. The corresponding estimate for the error in R44 is 


[Raa ~ Rash _ 5 ggo1 x 10-3. 


Since this error estimate is smaller than € = 5 x 10-5, we terminate computations. 
Our final results are summarized below: 
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Approximate value of integral: 0.7853964459 
Error estimate: 1.6621e-5 
Number of function evaluations: 9 


Note that the actual error is 1.7175 x10~®. 


EXAMPLE 6.18 The Cost of a Warranty 


In Section 6.5, the value of the integral 


1 
ay 
[ota 

0 


which arises in the determination of the cost of a treadware warranty on tires, was 
approximated using the composite trapezoidal rule and the composite Simpson’s 
rule, with a requirement that the absolute error be no more than 10-5. The results 
obtained were 


Trapezoidal rule: So e~*" da ~ 0.8448344011 (error = 4.194 x 10-8) 
Simpson’s rule: 6 e-*" da = 0.8448403780 (error = 1.783 x10-°) 


To compute the trapezoidal rule approximation, 172 function evaluations were 
needed; 17 function evaluations were needed to obtain the Simpson’s rule approxi- 
mation. 

Using Romberg integration with a convergence tolerance of 10~° produces the 
results: 


Approximate value of integral: 0.844838710518 
Error estimate: 8.8728e-07 
Number of function evaluations: 17 


The actual error in this approximation is 1.1576 x1077. Hence, with the same num- 
ber of function evaluations as the composite Simpson’s rule, Romberg integration 
produces an error that is one order of magnitude smaller. 


Application Problem: Tabulating the Error Function 


Based on our analysis in the Chapter 6 Overview (see page 431), if we tabulate 
values of the error function, 


2 eee 
erf(xz) = 7a e~* dt, 


and its first derivative, jee™, from z = 0 to z = 2 in increments of Az = 0.1, then 
we are guaranteed that Hermite cubic interpolation between successive entries in 
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the table will introduce an error smaller than 5 x 10-°. Using Romberg integration 
with a tolerance of 5 x 1075 to evaluate erf(z) for each x, we produce the following 


table. 


EXERCISES 
1. 


2 og g? 


t erflx) see a erf(x) Re 
0.0 0.00000 1.12838 


0.1 0.11246) 1.11715 1.1 0.88020 0.33648 
0.2 0.22270 1.08413 1.2 0.91031 0.26734 
0.3 0.32863 1.03126 1.3 0.93401 0.20821 
0.4 0.42839 0.96154 1.4 0.95229 0.15894 
0.5 0.52050 0.87878 15 0.96611 0.11893 
0.6 0.60386 0.78724 1.6 0.97635 0.08723 
0.7 0.67780 0.69127 L7 0.983879 0.06271 
0.8 0.74210 0.59499 1.8 0.98909 0.04419 
0.9 0.79691 0.50197 19 9.99279 0.03052 
1.0 0.84270 0.41511 2.0 0.99582 0.02067 


Romberg integration approximates the value of the integral 


1 
L 
—s d 
[ lta? i 


with an error of 1.2113 x 107") using only 33 function evaluations. How many 
function evaluations would be needed to achieve the same level of accuracy us- 
ing the composite trapezoidal rule, the composite midpoint rule, the composite 
Simpson’s rule, and the composite fvo-point Gaussian quadrature rule? 


. Roroberg integration approximates the value of the integral 


1 
i e.” dz 
-1 


with an error of 4.2399 x 1074 using only 17 function evaluations. How many 
function evaluations would be needed to achieve the same level of accuracy us- 
ing the composite trapezoidal rule, the composite midpoint rule, the composite 
Simpson’s rule, and the composite two-point Gaussian quadrature rule? 


. Romberg integration approximates the value of the integral 


w 
| sing dx 
40 


with an, error of 1.3207 x 10~'* using only 33 function evaluations. How many 
function evaluations would be needed to achieve the same level of accuracy us- 
ing the composite trapezoidal rule, the composite midpoint rule, the composite 
Simpson's rule and the composite two-point Gaussian quadrature rule? 
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In Exercises 4-7, the first column of the Romberg integration table for the specified 
definite integral is provided. Complete the table and determine the absolute error in 
the final approximation. 


23561944902 
an/2 ~0.4879838567 
4. fo cosa de _ 0 g815735630 
—0.9709165361 
8.3890560989 
6.9128098779 


2 
5 foe az 65048101095 


6.4222978214 
40,0000000000 


i 34.4222051019 
6. fotva?+9de 35" 1013029795 


32.7750803748 
1.3333333333 
ae 1.1666666667 
7 fre 166666667 


1.1032106782 


In Exercises 8-13, 
(a) Starting with only one subinterval, construct the four row Romberg integration 
table for the indicated integral. 
(b) What is the error estimate for the final approximation? How does this compare 
with the actual error? 
(c) How many subintervals would have been necessary to achieve the same accuracy 
using the composite trapezoidal rule without extrapolation? 


3.5 
8. Is Fea dx 


9. F xe” dz 
10. fy Vi+ ae da 
11. tin tan la dz 
12. us ive dx 
13. itl sinz_ gy 


0 licosz 


In Exercises 14-19, approximate the value of the indicated definite integral to within 
an absolute error tolerance of 5 x 107” using Romberg integration. How many function 
evaluations are needed? 


14, f? sine de 


T oq 
15. ie Visa dz 
16. i V1+2° dz 
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17. 


18 


19 


20. 


2k, 
22, 


23, 
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f : sin(x”) dx 

0 

J a a elactllp 
° 40 1+2' 


‘ f x tan! (24) dx 


Use the table generated in the “Tabulating the Error Function” application prob- 
lem and Hermite cubic interpolation to approximate the value of the error func- 
tion at the indicated value of x. How well does the value obtained in this manner 
compare to the actual value of the error function? 


(a) x = 0.799 (b) x = 1.265 
(c) x = 0.156 (a) c£=1.771 
(e) r= 0.301 (f) 2 = 1.545 


Show that, for any k, Ry is the composite Simpson’s rule with h = (b-a)/2*-}, 
The table below gives the volume v (measured in cubic inches) and the pressure 
p (measured in pounds per square inch) of a gas as it expands. 


v 0.75 100 1.25 1.50 1.75 2.00 2.25 2.50 2.75 
p 898 68.7 55.0 45.8 39.3 344 30.5 275 260 


Estimate the work done by the gas, 


2.75 
We= p dv, 
0.75 


as follows: Use the trapezoidal rule with h = 2.0, h = 1.0, h = 0.5, and h = 0.25, 
and then extrapolate. 


Consider the integral 


i: a 
exp(—-ICt) y[A(1 oa y) = Biny| : 


which arises in the projection printing of a photoresist film. Here, M denotes 
the normalized photoactive compound concentration present in the resist film 
after exposure to light; A, B, and C' are material properties of the resist film; 
and the product Jt is the exposure energy of the light source used during the 
printing phase. For the resist material] AZ2400, A = 0.162/um, B = 0.184/um, 
and C = 0.0128 cm? /mJ. Suppose the exposure energy is 110 mJ/cm?. 

(a) For the resist material AZ2400, evaluate the above integral for M = 0.32 to 

five decimal places. 


(b) Determine the value of M, correct to four decimal places, so that 


is ey - 
exp(-rct) YIA(I - y) — Bing] 
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6.8 ADAPTIVE QUADRATURE 


Consider approximating the value of 


1 
| a8 cog (a1) da 
a 


using Simpson’s rule. An accuracy of six decimal places (i.e., a maximum absolute 
error of 5 x 10-7) is desired. Since for f(x) = 218 cos(a!®) 

max |f(4)(x)| = 906753.5482, 

O<2<1 
n = 318 (that amounts to 319 function evaluations) would be required to guarantee 
the desired accuracy. An examination of the graph of the integrand (Figure 6.11) 
suggests partitioning the integration interval into say [0,0.7|U [0.7, 1] to isolate the 
nearly constant left-hand portion of the integrand from the rapidly varying right- 
hand portion. Dividing the allowable error equally between the two pieces, the 
number of function evaluations required to guarantee a total absolute error of less 
than 5 x 10~" frora Simpson’s rule would be reduced to 124: 40 evaluations for the 
subinterval {0,0.7] and 84 more evaluations for the subinterval (0.7, 1]. 


a7 —T atoms | 4 overs T os T 


Q2F 


Oth 


(a) 01 02 03 a4 0.5 0.6 


Figure 6.11 Graph of 21° cos(x1®) over the interval [0, 1]. 


This example illustrates a general problem associated with composite Newton- 
Cotes formulas. If the step size is chosen small enough to resolve the most rapidly 
varying portion(s) of the integrand, then the use of equally spaced points implies 
that the step size will be smaller (possibly much smaller) than is necessary to 
resolve the slowly varying portion(s) of the integrand. We would therefore like to 


506 Chapter 6 Differentiation and Integration 


use nonuniformly spaced abscissas. These abscissas could be fixed ahead of time, 
as done above, by using a priori knowledge of the integrand. It would, however, 
be preferable to allow the points to be chosen automatically as the approximation 
evolves, making for a more robust algorithm that could be used on a wide variety 
of problems. Such algorithms are referred to as adaptive quadrature routines. As 
we will see below, the key to producing such schemes is the ability to estimate the 
error in the approximation as it is being computed. 


A Simple Adaptive Strategy 


Given a function f, an interval [a, b] and an error tolerance ¢€, the objective of an 
adaptive quadrature scheme is to automatically select the abscissas in such a way 
that the estimated error in the final approximation will be less than «. A simple 
adaptive strategy is outlined in the following four-step algorithm. 


1. Compute an approximation, Jp, to ii T(x) dz; 

2. Split the interval [a, 6] into two pieces, [a,c] and [c,b], where c = (a + b)/2, 
and then compute I; ~ [° f(z) dx and In = Ns f(a) dz; 

3. Compare I, + Io with Jp to estimate the error in Jy + Iz; and 


A, If |estimated error| < ¢, then accept /; + J_ as an approximation to i, f(x) dx 
otherwise 
apply the same procedure to [a,¢] and [c, |, allowing each piece a tolerance 
of €/2. 


In general, there is no requirement that the same basic quadrature scheme be 
used to compute all three approximations Jo, J;, and Ig. The key to implementing 
this algorithm, however, is step 3—the ability to combine Jp, 4), and Jo into an 
estimate of the error in the overall approximation as it is evolving. 


Estimating Error on the Fly 


The specific mechanism by which the error is estimated in step 3 depends on the 
basic quadrature scheme used to generate the approximations Ig, I, and Jz. Let’s 
suppose that each of these values was computed using Simpson’s rule and that h = 
(b — a)/2. Since Simpson’s rule is a fourth-order method, provided f is sufficiently 
differentiable, we know that 


6 
| fla) dx = S(a,b) + kyh* + o(h4), 
a 
where k, is a constant independent of h. Here, we have used S(a, b) to denote 
the Simpson's rule approximation computed over the interval ja, b]. Using similar 


notation for the Simpson’s rule approximations over [a,c] and [c,6], and having 
selected c to be the midpoint between a and 6, we also know that 


[1 da = S(a,c) + S(¢,b) + ki @) +o(h’). 
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Subtracting these last two expressions and salving for the leading term in the error 
of S(a,c) + S(e, b), we find 


n (2) =; Z[8(a,0) + 8(e,6) ~ 8(a,6)] + of) 
If we assume that the o(h‘) term is in fact negligible, then we will accept 
57 [Sla,c) + S{o.8) ~ $(a,)] 
as an estimate of the error in S(a,c) + S(c, 6). Therefore, we can test 


|S{a,c) + S(c,b) — S(a,d)| < 15e 


in step 3 of our basic algorithm. Since we have made several approximations and 
assumptions along the way, in practice, we would be conservative and use a number 
smaller than 15 in step 3—say 10. 

It is important to note that the only characteristic of Simpson’s rule that is 
used in obtaining the above error estimate is the fact that the rule is fourth order. 
It therefore follows that if Q(a,b) denotes any fourth order quadrature rule (such 
as the two-point Gaussian quadrature rule or the open Newton-Cotes formula with 
n = 2) applied over the integration interval [a,b] and ¢ denotes the midpoint of 
that interval, then 


1 
le (a,c) + O(c, b) — Qa, d)] 


represents an estimate for the error in the approximation Q(a,c) + Q(c, d). 

This technique for obtaining error estimates is easily extended to methods of 
other orders. In general, if Q(a, b) is a method of order p—that is, the leading term 
in the error is O(h®)—then 


(a,c) + Q(e, 6) — Q(a, )] 


is an estimate for the error in the approximation Q{a,c) + Q(c,b). As discussed 
above, when putting this error estimate into practice, a number smaller than 2? —1 
should be used to be conservative. 


Adaptive Quadrature in Action 


Recall the problem posed at the beginning of this section: Approximate the value 
of the integral 


1 
| a'6 cos(x'®) dx 
0 


to six decimal places of accuracy. Let’s use Simpson’s rule as the underlying quadra- 
ture rule and continue to use S(a,b) to denote the Simpson’s rule approximation 
computed over the interval Ja, dj. 
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Our first step is to calculate 


(0,1) = 3 [fo )+ a (5 ) ae i | = = 0.09006055683740. 


Next, we cut the integration interval in half and calculate 


i 1 
8 (0, 5) ois ie a + (3) = 0.00000127164336; and 
gE a: \ est OTs Ge = 0,04836716117637. 
a} 42 war 4 7 


Our estimate for the error in $(0, 5) + 5(3, 1) is therefore 


aa 3) +8 (5. 1) — $(0,1}| = 4.169 x 10°, 


Since this error estimate is larger than the specified tolerance of 5 x 1077, we repeat 
the entire process on the subintervals [0, 5] and (2, 1], allowing each subinterval an 
error of 2.5 x 1077. 

On [0,3], we already know that $(0,4) = 0.00000127164336. Cutting the 
interval in half, we then calculate 


cS (0. 7 = ot 70 +4f (3) 4uif G)}= = 0.00000000000970; and 


8 ($5) = a lf (3) sar (5 )+ f (5) = 0.00000066128135. 


Accordingly, we estimate that the error in $(0, 4) + S(4, 4) is 


1 1 154 1 =e 
a 2 Say Ss 0,2 6.108% 1078, 
1 Is(o2) +9(24) -s10,5|=a20e» 


which is smaller than 2.5 x 10°’. Hence, we are finished with the subinterval [0, 3] 
and have found that 


1/2 
i a8 cos(x}®) dx = 0.00000000000970 + 0.00000066128135 
0 
= 0,00000066129105 


with an estimated error of 6.103 x 1078, 
Moving on to [5,1], we already know that S(5, 1) = 0.04836716117637. Next, 


we calculate 
+ f (5) +4f € ) +f @l = 0.00050857313250, 
Es 
24 


13 
s(b4) 
fi G) ae ) +£0)| = 0,04247103735072, 


“()- 
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and the error estimate 


1|./1 3 3 Woell. >; 


Since this error estimate is larger than 2.5 x 10-7, we must further subdivide (3, 1] 
into [5, 3] and [},1]. Each of these new subintervals is allotted an error of 1.25 x 
7 


If we continue to work in this fashion, we eventually find 
l 
i: a!® cos(z!®) dx = 0.049121999503, 
0 


with an error estimate of 1.6873 x 10-7. To thirteen decimal places, the value of this 
integral is 0.0491217295177. The actual error in the approximate value is therefore 
2.6999 x 10-7. Thus, the reported error estimate is slightly low, but of the correct 
order of magnitude, and the overall error is well within the requested accuracy. 

Figure 6.12 displays the 73 non-uniformly spaced points, represented by cir- 
cles, at which the integrand was sampled to arrive at the above approximation. 
Compare this with the 319 uniformly spaced abscissas which would have been 
needed to guarantee similar accuracy. The asterisks along the top of the figure 
display the locations of the abscissas and have been included to more clearly demon- 
strate the varying density of the abscissas within the integration interval. At the 
left end of the interval, where the integrand is relatively flat, few quadrature points 
are selected. As we move to the right, the integrand varies more widely, and the 
density of quadrature points increases accordingly. 


An Efficient Implementation 


To achieve the efficiency indicated by Figure 6.12 (i.e., using only 73 function eval- 
uations to obtain the final approximation), we need to be careful how we construct 
our algorithm. A naive translation of the basic four-step strategy produces 


function adapt ( f,a, },€ ) 
set sab = S(a, bd) 
set c= (a + b)/2 
set sac = S{a,c) 
set scb = S(c,b) 
if ( |sac + seb ~ sab|/10 < €) 
return ( sac + scb ) 
else 
return ( adapt ( f,a,c,§ ) + adapt ( f,c,b, § ) ) 


This algorithm, however, is very wasteful of computational effort. For instance, 
suppose we approximate joa* cos(z!®) dz to an accuracy of six decimal places. 
Though the integrand is evaluated at only those 73 points shown in Figure 6.12, 
the integrand is evaluated a total of 315 times. 
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Figure 6.12 Circles represent the 73 function values sampled from the 
integrand z’° cos(z*®) in calculating an estimate for the integral accurate 
to six decimal places. Asterisks denote the locations of the quadrature 
abscissas to demonstrate the variable density within the integration in- 


terval. 


Why is this algorithm so inefficient? First, with each recursive call to adapt, 
one Simpson’s rule approximation is entirely recalculated. That’s three unnecessary 
function evaluations per recursive call. Second, the algorithm continually recalcu- 
lates function values used to construct different Simpson’s rule approximations. For 
example, f(0) is calculated as part of $(0,1), S(0,3), and S(0,4), while f(4) is 
calculated as part of 

S(0, 1), $(0, 3), S(4.1), S(4, 4), 9(, 3), S(4, 3), and, S(4, 8). 
To improve efficiency, we have to find some way to save Simpson’s rule approxima- 
tions and function values for later use. 

An algorithm which achieves this objective is given below. Note that this 
algorithm is separated into two parts: an initialization component, adapt, and a 
recursive component, adapti. The initialization component simply calculates the 
firs: three function values and the first Simpson’s rule approximation. This infor- 
mation is then passed along to the recursive component. Each call to the recursive 
component carries out the second, third, and fourth steps of the basic strategy 
for a particular portion of the integration interval. By passing the Simpson’s rule 
approximation and the function values needed to process each subinterval through 
the argument list, this algorithm calculates each Simpson’s rule approximation and 


each function value only once. 
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function adapt ( f,a,b,€ ) 


set fa = f(a) 
set fe = f((a+ b)/2) 
set fo = f(b) 


set sab = (b— a) *(fat4« fot fb)/6 
return ( adaptl ( sab, fa, fc, fb, f,a,b,€) ) 


function adapti ( sab, fa, fc, fb, f,a,b,€ ) 
set c= (at b)/2 
set fd = f((a+c)/2) 
set fe = f((¢ + b)/2) 
set sac = (c—a)#(fat+4+* fd+t fe)/6 
set scb= (b-—c) *(fo+4 fet fb)/6 
if ( |sac + seb — sab|/10 < € ) 
return ( sac + scb ) 
else 
return ( adapt! ( sac, fa, fd, fc, f,a,c, § ) + 
adapt1 ( scb, fe, fe, fb, f,¢,b,§ ) ) 


A Second Strategy 
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An alternative to the adaptive strategy presented above is the four-step algorithm: 


1. 
2. 


Compute an approximation, Ip, to Lf) dx; 

Compute a second approximation, J, using a method of higher order than 

the one used to compute Jp; 

Use I, — Ip as an estimate of the error in J,; and 

If lestimated error| < ¢, then accept J; as an approximation to f(x) dz 
otherwise 


apply the same procedure to [a,c] and {c, 5], allowing each piece a tolerance 
of €/2, where c= (a + b)/2. 


The error estimate in step 3 is justified as follows. Suppose the approximation Ip 
is computed using a quadrature rule that is O(R?), while J, is computed using a 
rule that is O(A%), with q > p. Then 


and 


6 
| f(x) dx = Ip + coh? + o(h?) 


[te dz = qh +eh? + o(h4), 


where cg and c; are constants indepedent of h. Subtracting these two equations 


gives 


0 = Ip — Ly + coh? — eh? + ofhP) — ofh?). 
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Assuming all terms o(h?) and smaller can be neglected, it follows that 
coh? was pS Io; 


hence, the leading term in the error in Jy is approximately I, ~ Jp. Since J; is 
a tnore accurate approximation than Jp, it follows that J, — Ip is a conservative 
estimate for the error in J}. 

Applying this second adaptive strategy, with Simpson’s rule as the lower-order 
method, Boole’s mile as the higher-order method and € = 5 x 107", we find 


1 
/ x} cos(x'®) da = 0.049121730001, 
0 


with an error estimate of 1.0920 x 107”. The actual error in this approximation is 
4.8378 x 10-19, which is quite a bit smaller than the reported error estimate. 

By constructing the algorithm to reuse function values whenever possible, the 
above approximation was obtained with 141 function evaluations. Though this sec- 
oncl approach needed nearly twice as many function evaluations as the first adaptive 
scheme to achieve six decimal places of accuracy, this scheme still used fewer than 
half the number of function evaluations required by the composite Simpson’s rule. 


Two Comparison Problems 


The following examples compare the performance of the adaptive strategies de- 
scribed in this section applied to several different basic quadrature rules. Each 
scheme based upon a fourth-order method (the adaptive Simpson’s rule and the 
adaptive two-point Gaussian quadrature rule) uses 


[Q(a, c) + Qe, 6) = Qa, b)]/10 


as a conservative estimate of the error in Q(a,c) + Q(c,b). The schemes based 
upon sixth-order methods (the adaptive Boole’s rule and the adaptive three-point 
Gaussian quadrature rule) use 


[Q(a, ¢) + Q(c, b) _ Q(a, b)]/42 


as an error estimate. A brief explanation is warranted for the choice of constant in 
this error estimate. Theory indicates that division by 15 should be used in the error 
estimate for fourth-order methods. We cut one-third off of this value and used 10 to 
be conservative. For sixth-order methods, theory indicates that the error estimate 
should involve division by 63—cutting one-third off this value leaves 42. 


EXAMPLE 6.19 Comparison of Adaptive Quadrature Rules 1 


Consider the integral 


2 
I= i e* sin(z” cose”) dx # —1.11595799093275. 
0 
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ef sin (x* cos @”) 
oo 


Figure 6.13 Graph of e” sin(x2cos e”) on [0, 2]. 


The graph of the integrand is shown in Figure 6.13. This integrand is similar to the 
one considered earlier in this section in that the left portion is quite flat and the 
amount of variation increases as we move to the right. Adaptive quadrature should 
therefore prove very helpful in efficiently approximating the value of this integral. 
The following table lists the number of function evaluations needed to ap- 
proximate J to five decimal places of accuracy (ec = 5 x 107°) and to ten decimal 
places of accuracy («= 5 x 1071). To save space, we have abbreviated “two-point 
Gaussian quadrature” as GQ2 and “three-point Gaussian quadrature” as GQ3. 


Accuracy 
5 decimal places 10 decimal places 
Adaptive Simpson 217 3697 
Adaptive GQ2 414 7054 
Adaptive Simpson/Boole 401 7029 
Adaptive Boole 113 689 
Adaptive GQ3 137 857 
Romberg Integration 129 1025 


Some of the function evaluation counts given above (especially those asso- 
ciated with the fourth-order basic quadrature rules) may seem rather high. To 
place these counts into perspective, let’s determine the number of function evalua- 
tions which would be required by the corresponding composite quadrature rules to 
achieve each level of accuracy. For f(x) = e® sin(x? cose*), 


max |f| = 2.5823 x 107. 
O<n<2 


Substituting this value into the error term for the composite Simpson’s rule leads to 
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an estimate of 981 function evaluations to achieve five decimal places of accuracy 
and 17408 to achieve ten decimal places of accuracy. The function evaluation 
counts for the composite two-point Gaussian quadrature rule to achieve five and 
ten decimal places of accuracy are 886 and 15730, respectively. Comparing these 
counts to those listed above, we see that the adaptive Simpson’s rule reduces the 
work effort by better than a factor of 4, while the adaptive two-point Gaussian 
quadrature rule and the adaptive Simpson/Boole rule reduce the work effort by 
better than a factor of 2. 


es 


EXAMPLE 6.20 Comparison of Adaptive Quadrature Rules 2 


As another test problem, consider the integral 


5 
50 250 
i= SS = oh eS yy 
[ m+ 250022) dz = tan = 0.49872676724581. 


The table below lists the number of function evaluations needed to approximate I 
to five decimal places of accuracy (e = 5 x 107°) and to ten decimal places of 
accuracy («= 5 x 107}). 


Accuracy 
5 decimal places 10 decimal places 
Adaptive Simpson 177 3161 
Adaptive GQ2 326 5630 
Adaptive Simpson /Boole 309 5629 
Adaptive Boole 105 633 
Adaptive GQ3 127 787 
Romberg Integration 1025 8193 


Comparing the function evaluation counts for the adaptive schemes with those 
required by the corresponding composite routines is left as an exercise. 


It is interesting to note that the adaptive Gaussian quadrature routines (both 
two-point and three-point) underperform the routines based on Newton-Cotes quad- 
rature rules of similar order. In Section 6.6, we found that Gaussian quadrature 
rules have smaller coefficients on their error terms, and therefore, generally require 
fewer function evaluations than Newton-Cotes rules when used in a composite man- 
ner. When applied in an adaptive manner, however, Newton-Cotes rules have the 
advantage of being able to reuse function values. With Gaussian quadrature rules, 
the only time that function values can be reused is when we implement the second 
adaptive strategy and choose rules that both evaluate the integrand at the midpoint 
of the integration interval. Even then, the only function value that can be reused 
is that value at the midpoint of the integration interval. 
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The underperformance of the adaptive two-point and three-point Gaussian 
quadrature rules should not be taken to imply that all adaptive Gaussian quadrature 
rules underperform their Newton-Cotes counterparts. For, note that while the 
adaptive Simpson's rule uses fewer than 60% the number of function evaluations 
of the two-point Gaussian quadrature scheme, the adaptive Boole’s rule uses more 
than 80% the number of evaluations of the three-point scheme. Eventually, adaptive 
Gaussian quadrature surpasses adaptive Newton-Cotes quadrature. 

We make one final comment regarding adaptive Gaussian quadrature. Though 
different Gaussian quadrature rules do not share abscissa (with the possible excep- 
tion of the integration interval midpoint), it is possible to construct higher-order 
quadrature rules which use the abscissas of a given Gaussian quadrature rule. These 
rules can be paired with the Gaussian quadrature rule to form efficient adaptive 
schemes based on our second basic strategy. For example, Kronrod [1] showed how 
to supplement the abscissas of any n-point Gaussian quadrature rule with n + 1 
new abscissas so as to construct a formula with maximum degree of precision. The 
resulting (2n + 1)-point Kronrod rule has degree of precision 3n +1 when n is even 
and 3n + 2 when n is odd. Patterson [2] built upon Kronrod’s idea by adding an 
additional 2n + 2 abscissas to obtain a quadrature rule with degree of precision 
6n + 4. 


Application Problem: Flow between Parallel Plates 


A classic problem in fluid mechanics is determining the flow induced in a viscous 
fluid filling the gap between two large parallel plates, as shown in Figure 6.14(a). 
The lower plate is fixed and the upper plate moves steadily with velocity Ug. The 
distance between the plates is h, and no-slip conditions are assumed at both plates. 
This means that the fluid velocity is zero at the lower plate and Up at the upper 
plate. 


TAXAZ 


Figure 6.14 (a) Viscous fluid filling the gap between two parallel 
plates, one fixed, the other moving with constant velocity. (b) Force 
balance on a representative fluid element. 
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Let U(Y’) denote the steady-state velocity distribution established in the fluid 
as a result of the shearing motion of the upper plate. To determine U(Y), we 
first consider the representative fluid element shown in Figure 6.14(b). Assuming 
no pressure variation in the flow direction and that gravitational forces may be 
neglected, the only force acting on the fluid element is the shear stress, rT, acting 
along the top and bottom surfaces. Summing forces in the X-direction yields the 
equation 


(: ae oY) AX AZ—rAX AZ =0, 


or, after simplification, 


dt 
—; = 0. 
dY (1) 
If the fluid is newtonian, then the shear stress is related to the velocity by the 
law 
dU 
PS May? (2) 


where ps is a property of the fluid called the coefficient of viscosity. In the simplest 
case, we would assume that jy was constant; however, to add a little twist, let’s 
suppose that the two plates are maintained at different temperatures and a linear 
temperature gradient exists within the fluid. Now, temperature has a strong effect 
on viscosity. Hence, assuming that temperature is a function of Y implies that pu 
would also be a function of Y. Using this assumption in (2) and then substituting 
into (1), we find that U(Y) satisfies the differential equation 


ap (Hrs ) =0. (3) 


Additionally, we have the boundary conditions V(0) = 0 and U(h) = Up. 
Integrating (3) twice with respect to Y, it follows that 


ee ae 
Ue) =o fo te 


The boundary condition U(0) = 0 requires that cp = 0. Next, the condition at 


Y =h requires that : 
‘ = 
dg 
ae ig UY 4) 


xy . 
=i So ab / ule) (4) 
So 48 /u(E) 


Let’s suppose the fluid is water, the bottom plate is maintained at a temper- 
ature of 20°C and the top plate is maintained at 100°C. The following table lists 
the viscosity of water as a function of temperature. 


Therefore, 


U(Y) 
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T(°C) 20 30 40 50 60 7 80 90 100 
u(x t073 N-s/m?) 1.003 0.799 0.657 0.548 0.467 0.405 0.355 0.316 0.283 


According to White [3], the best fit to this data is the empirical result that Inu 
is quadratic in 1/T, when T is measured in the Kelvin scale. Temperature in the 
Kelvin scale is obtained from temperature in the Celsius scale by adding 273.16. 
After performing the indicated data transformations, regression yields 


p(T) = exp (-s.04 lear st a} 


z T? (5) 


With the assumption of a linear temperature gradient between the two plates, it 
follows that 
YN oe yy. 
T(Y) = | 20+ 805 C = [ 293.16 + 807 K. (6) 


Substituting (5) and (6) into (4) and then introducing the dimensionless variables 
u = U/Up and y = Y/h, the expression for the velocity becomes 


y 839.456 421194.298 
= to exp (8.944 + 393164806 ~ eis nee | ag 


~ pl 839.456 421194.298 
to exp (8.944 + 393.164808 aaa | dg 


u(y) (7) 


Figure 6.15 displays the nondimensional velocity distribution obtained from 
equation (7). To produce this graph, values of u were calculated at y; = 0.014 for 
4=0,1,2,...,100. All integrals were evaluated using adaptive three-point Gaussian 
quadrature with e = 5 x 1077. The independent variable has been plotted along 
the vertical axis to match the geometry depicted in Figure 6.14(a). 


References 


1. A.S. Kronrod, Nodes and Weights of Quadrature Formulas, Consultants Bureau, 
New York, 1965. 

2. T.N.L. Patterson, “The Optimum Addition of Points to Quadrature Formulae,” 
Mathematics of Computation, 22, 847-856, 1968. 

3. F. M. White, Fluid Mechanics, 2nd edition, McGraw-Hill, New York, 1986. 


EXERCISES 


1. For each of the following integrals, compute S(a,b), S(a,c), and S(c,b), where 
c= (a+b)/2. Compute the estimate for the error in S(a,c)+S(c, 6) and compare 
this to the actual error is S(a,c) + S(c, b). 


(a) foe? dx (b) fi ide 
(c) fava? +9de (a) fo tan“ adx 

2. Repeat Exercise 1 using Boole’s rule (the closed Newton-Cotes formula with 
n= 4). 


3. Repeat Exercise 1 using the two-point Gaussian quadrature rule. 
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Figure 6.15 Nondimensional velocity distribution in viscous fluid fill- 
ing gap between two large parallel plates. 


4. Repeat Exercise 1 using the three-point Gaussian quadrature rule. 


o 


For each of the integrals in Exercise 1, compute the Simpson’s rule approximation 
and the Boole’s rule approximation. Confirm that the difference between these 
two values approximates the error in the Simpson’s rule value. 


For each of the integrals in Exercise 1, compute the two-point Gausssian quadra- 
ture rule approximation and the three-point Gaussian quadrature rule approxi- 
mation. Confirm that the difference between these two values approximates the 
error in the two-point Gaussian quadrature rule value. 


. Determine the number of function evaluations which would be needed to guar- 


antee an accuracy of 10 decimal places in the approximation to the value of 


5 
50 
ref a(1 + 250022) 


using the composite Simpson’s rule and the composite two-point Gaussian quad- 
rature rule. Compare with the number of function evaluations required by the 
corresponding adaptive routines listed in the second example above. 


In Exercises 8-16, approximate the value of the given integral to six (6) and to ten 
(10) decimal places using the adaptive quadrature scheme of your choice. Compare 
the number of function evaluations used to the number that would be required by the 
corresponding composite quadrature rule to achieve the same accuracy.. 


8. 


ce en*" dx 


9, 


10. 
11. 
12. 


13. 


14. 
15. 


16. 
17. 


18. 


So Fees 
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sin(x? cos e~*) dx 


{ V14+ a4 dz 


fy cared 


fo” 


1 
Jo Tre 


Be 25% dr 


dz 


Jo cos(cosa + 3sinz + 2cos(2x) + 3.cos(3x) + 3 sin(2x)) dx 


{a) 


(b) 


(a) 


(b) 


Evaluate the integral 


i sin(./mz) dz 


0 
to six decimal places of accuracy using the adaptive Simpson’s rule. How 
many function evaluations were needed? 
Make the change of variable u? = ma in the integral from part (a) and 
reevaluate using the adaptive Simpson’s rule. How does the number of 
function evaluations compare with the number from part (a)? 


Evaluate the integral 


: 2 
i; 2+sin(10mz) ae 


to ten decimal places of accuracy using the adaptive Simpson’s rule. How 
many function evaluations were needed? 

Recognizing that the integrand in part (a) is periodic with period 1/5, re- 
compute the value from part (a) as 


0.2 9 
s | 2+ sin(107z) se 


using the adaptive Simpson's rule. How does the number of function evalu- 
ations compare with the number from part (a)? 


19. The Fresnel integrals 


e(z) = ‘a cos (F:") dt 3(z) = i: sin (F+") dt 


arise in the study of light diffraction at a rectangular aperture. 
(a) Construct a table of values for c(z) and s(x) for x ranging from 0 through 


2 in increments of 0.2. Each entry in the table should be accurate to five 
decimal places. 


(b) Determine the two smallest positive values for x such that c(r) = 0.5, accu 


rate to four decimal places. Repeat for the equation s(x) = 0.5. 
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20. Consider the integral 


(a) Use the adaptive two-point Gaussian quadrature scheme to tabulate the 
value of this integral for x ranging from 0 through 10 in increments of 0.5. 
Each tabulated value should be accurate to six decimal places. 


(b) What happens if you try to use the adaptive Simpson’s rule to tabulate the 
values of this integral? Can you think of a way to alleviate this problem? 


21. Evaluate 


(et loans oe 
f e2(1-2)3/3 gp 


and 


Lf Qe a 2 3 
ie eo? +22°/3 a 


ti e722 +223 /3 dz 


These expressions arise in determining the probability that an allele with a se- 
lective advantage over its competitors will become fixed in a population (see P. 
D. Taylor and A. Sauer, “The Selective Advantage of Sex-Ratio Homeostasis,” 
American Naturalist, 116, 305-310, 1980). 


22. Rework the “Flow between Parallel Plates” problem assuming that the lower 
plate is maintained at 100°C and the upper plate is maintained at 20°C. 


6.9 IMPROPER INTEGRALS AND OTHER DISCONTINUITIES 


i ” f(a) de 


e one or both of the integration limits is infinite; or 
e the integrand, f, is discontinuous somewhere on the integration interval [a, 5]. 


The definite integral 


is called improper if 


Provided the integral exists, the quadrature rules that have been developed in 
this chapter can be used to approximate the value, though some change of variable 
generally must be made to achieve theoretical order of convergence. The objective of 
this section is to demonstrate the substitutions appropriate for a variety of different 
situations. 
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Trapezoidal Simpson's Midpoint Two-Point 
n Rule Rule Rule Gaussian 
2 3.502 x 107% 1.336 x 107? 1.117 x10? 1,338 x 1073 
4 1.193 x 10-2 4,229 x 1078 3.963 x 1073 4,225 x 10-4 
8 3.982 x 10-3 1.334 x 1078 1.361 x 10-3 1.331 x 1074 
16 1.311 x 10-3 4.202 x 1074 4.568 x 10-4 4.194 x 1075 
32 4,269 x 107* 1.324 x 1074 1.509 x 1074 1.321 x 1078 
64 1.380 x 1074 4.169 x 1075 4,930 x 1075 4.161 x 10-8 
Order of Convergence: 
Theoretical 2 4 2 4 
Experimental 1.5986 1.6651 1.5667 1.6659 


TABLE 6.5: Error in Approximate Value of aes 22/3 dz Computed Using Four Basic Quadrature Rules 


Discontinuous Derivatives 


Consider the definite integral 


1 
/ eh? da. 
0 


Technically, this integra! does not fit into this section because it is not improper. 
However, the development of efficient procedures for evaluating certain improper 
integrals will depend upon the information we obtain from studying this prob- 
lem. Table 6.5 lists the error in the approximate value of the above integral 
computed using four basic quadrature rules: the trapezoidal rule, Simpson’s rule, 
the midpoint rule, and the two-point Gaussian quadrature rule. Each method 
works in the sense that as n is increased, the approximation error converges to 
zero; however, none of the methods performs at its theoretica] order of conver- 
gence. 

What caused this reduction in the order of convergence? Note that though 
the integrand in this problem is defined throughout the integration interval, all 
of its derivatives are unbounded at + = 0. Every quadrature rule that has been 
developed in this chapter has an error term which contains some order derivative 
of the integrand. For example, the error term for the trapezoidal rule contains the 
second derivative of the integrand, while the error term for the two-point Gaussian 
quadrature rule contains the fourth derivative. The theoretical order of convergence 
of each method is therefore predicated on sufficient smoothness in the integrand. 
As this example demonstrates, when the integrand is not sufficiently differentiable, 
the methods may still work, but if they do, they may perform below theoretical 
levels of efficiency. 

The derivatives of 12/5 become unbounded at x = 0 because of the presence 
of a fractional exponent. We can easily clear the fractional exponent by making the 
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Trapezoidal Simpson’s Midpoint Two-Point 
nh Rule Rule Rule Gaussian 
2 2.438 x 107! 2.500 x 10°? 1.195 x 107! = 1.042 x 10-3 
4 6.211 x 1072 1.563 x 1073 3.091 x 107? 6.510 x 1075 
8 1.560 x 10-2 9.766 x 10-5 7.791 x 1073 4.069 x 1076 
16 3.905 x 107% 6.104 x 1076 1.952 x 1078 2.543 x 10-7 
32 9.765 x 10-* 3.815 x 10-7 4.882 x 10-* 1.589 x 10-8 
64 2.441 x 1074 2.384 x 10-8 1.221 x 10-4 9.934 x 19719 


Order of Convergence: 
Theoretical 2 4 2 4 
Experimental 1.9939 4.0000 1.9894 4.0000 


TABLE 6.6: Error in Approximate Value of f 3u4 du Computed Using Four Basic Quadrature Rules 


change of variables z = u?. With this substitution, the original integral becomes 


l 
i, 3u4 du. 
0 


The new integrand, i (u) = 3u‘, is now infinitely differentiable. The data in Ta- 
ble 6.6 confirm that when the trapezoidal rule, Simpson’s rule, the midpoint rule, 
and the two-point Gaussian quadrature rule are applied to this transformed inte- 
gral, each method performs at its theoretical order of convergence. 


Removable Discontinuities 


Recall that when a function f has a discontinuity at z =a, but limz—, f(x) exists, 
that discontinuity is said to be removable. The proper handling of a removable 
discontinuity depends on whether or not the derivatives of f are bounded as x 
approaches a. This issue can be determined by expanding f in an infinite series 
about z = a. 

Consider the function f(x) = (e* — 1)/z, which is discontinuous at z = 0. 
Expanding the exponential function into its MacLaurin series and simplifying, we 
find 

z dy2 4 id...) 
f(o)=¢ : 1_ (l+2+32 oe +--+) ee 7 1 spat, 


Two observations can be made from this expansion. First, limzo f(x) = 1, so 
the discontinuity is removable. Second, since the expansion contains only positive 
integer powers of z, all derivatives of f are bounded as x approaches 0. 

What do these observations imply regarding the evaluation of the integral 


1 
/ oi dx? 
-1 x 
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Since the derivatives of the integrand are bounded throughout the integration in- 
terval, there is no need to make a change of variables, and we can expect that 
any method will perform at its theoretical order of convergence. All we have to 
do is avoid evaluating (e? — 1)/z at z = 0. This can be accomplished by carefully 
selecting the method to guarantee that z = 0 is not used as an abscissa. A better 
approach is to program the integrand as the piecewise function 


Ra ee 


explicitly taking into account the limit as 2 approaches 0, and then we can use any 
method. Following the latter approach and using the adaptive three-point Gaussian 
quadrature rule with an error tolerance of 5 x 107!, the value of the integral is 
found to be 2.1145017507. 

Next consider the integral 


o vz 


a a ee ee 9/2 
—= = ——wmnn—————— mm hCc et —- -2% —2 _— tee 
Je Ve 6 ” 120 + 
Hence, lim,—9sinz/./z = 0, and the discontinuity at 2 = 0 is removable. This 
expansion also establishes that the derivatives of the integrand are discontinuous 
at x = 0. Since the powers of x are all multiples of 1/2, the appropriate change of 
variables is 7 = u?. With this substitution, the original integral becomes 


1 
| 2sin(u?) du. 
) 


The adaptive three-point Gaussian quadrature rule, with a tolerance of 5 x 107?!, 
produces the results 


Approximate value of integral: 0.620536603446 
Error estimate: 1.0248e-11 
Number of function evaluations: 87 


For comparison, had the change of variable not been carried out, the same quadra- 
ture rule with the same tolerance parameter would have yielded 


Approximate value of integral: 0.620536603453 
Error estimate: 8.8688e-12 
Number of function evaluations: 677 


Thus, nearly eight times as many function evaluations are needed when the discon- 
tinuous nature of the derivatives is not taken into account. 
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Nonremovable Algebraic Discontinuities 


Nonremovable discontinuities—those for which lim,.. f(z) does not exist—can be 
subdivided into two categories: algebraic and logarithmic. Algebraic discontinuities 
are those for which the integrand behaves like (x — a)~* for some a > 0 as a 
approaches a. On the other hand, with a logarithmic discontinuity, the integrand 
tends to infinity like In(z — a) as x approaches a. We will start by examining the 
proper way to deal with an algebraic discontinuity. 

When working with a nonremovable algebraic discontinuity, some change of 
variables will have to be made. To determine the appropriate substitution, a series 
expansion of the integrand is an extremely valuable tool. As an example, take the 


definite integral 
if (cos( Ya) - 1)ev* 
o |= ar2(14+ 82) 
The integrand is clearly undefined at the lower Limit of integration. Expanding the 
cosine, the exponential and the reciprocal of 1+ #z into their respective MacLaurin 
series and simplifying yields 


2) ~ 1) ev= és ak/5\ f  gk/2\ [ 


k=0 


woo ae eye a L ephis sari en 


2 2 24 


There are three observations to make from this expansion. First, since the leading 
power in the expansion is larger than —1, the discontinuity at x = 0 is integrable. 


Second, 
iim (cos(#/2) — 1) ev* ze 
e0t = n2(14+ Ye) 
so the discontinuity is not removable. Finally, the least common denominator of 
the rational exponents in the expansion is 30. This indicates that the appropriate 
change of variable for this problem is z = u3°. In terms of the variable u, the 


integral reads . 
1. Sgteonu’ =e" . d 
[0 ae 


The value of this integral is found to be —0.7043797072 using the adaptive Boole’s 
rule with an error tolerance of 5 x 107+}. 


—0O, 


Logarithmic Discontinuities 


Whereas algebraic discontinuities are typically handled by making a change of vari- 
able, logarithmic discontinuities are typically treated by subtracting out the leading 
term in the behavior near the discontinuity. Consider the integral 


[ Bae 
o l+2 
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As z approaches 0, the value of the natura] logarithm tends toward negative infinity, 
and thus the integrand has a logarithmic discontinuity. Expanding (1 + 2?)7} in 
series, we find 


Ing 
Top (ne tat-ah+—--)Ing 


=Ine-2*Ingtetlne—2®lne+—---, 


or the discontinuous behavior of the integrand is controlled by Inz. Note that each 
of the terms following the first is well behaved since lim, 9+ 2*Inz = 0 for all 
a > 0. Subtracting away the discontinuous behavior of the integrand, we rewrite 
the original problem as 


1 I 1 
[ame-/| 
0 l+2 0 


1 2] 1 
--| a de+ | Ing dz. 
0 L+2? 0 


The second integral can be evaluated analytically: 


Ing 
1+2? 


1 
-ing| ax [ Inz dz 
0 


1 
[ Inzdz = (zInz — 2)|9 =-l1. 
0 


In the first integral, the logarithmic discontinuity has been replaced by a removable 
discontinuity. Programming the integrand as the piecewise function 


sing 
F — [+22 > r#0 
f(z) { 0, nae 


and using the adaptive Boole’s rule, the value of the first integral is found to be 


DL. jut 
_ | a" In 1. ~ 0.0840344058. 
0 1 + 2 


Therefore, 


1 
i ~~. dx = 0.0840344058 + (—1) = -0.9159655942, 
0 l+z 
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EXAMPLE 6.21 Another Logarithmic Discontinuity 


As a second problem with a logarithmic discontinuity, consider the definite integral 


i 
[ In(z — sina) dz. 
0 


As in the previous problem, this integral has a logarithmic discontinuity at 2 = 0. 
To determine the leading behavior in the discontinuity, we expand the sine function 
into its MacLaurin series and simplify: 


infe~ sine) = bn - (o- fa + at +) 


= tna? + In( 3 - py? +--+). 


Thus, we want to subtract In x? from the integrand. Balancing this with the addi- 
ticn of In x, followed by the application of the properties of logarithms, we rewrite 
the original integral as 


1 1 1 
[ In(z - sin) dz = / [In(x —sinx) — nz] da + i; Ina? da 
0 0 


0 
1 A‘ 1 
ay in( =) ae +3 | Ina dz. 
0 zr 0 


The discontinuity in the first integral on the right-hand side is now removable and 
is handled by programming the integrand as the piecewise function 


Fee) = { oe z#0 


In(g), a= 0. 


The adaptive Boole’s rule provides the estimate 


i fs 
| in (=*) dix = —1.8084378485; 
0 ee 


therefore, 


1 
| In(a — sina) dz = —1.8084378485 + 3(-1) 
0 


= —4.8084378485. 


ee ee 
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Infinite Limits of Integration 


A variety of different change of variables can be used to transform infinite inte- 
gration intervals into finite intervals. Here, we will focus on the two substitutions 
z= 1/u and x = tan@. For a review of other substitutions used to treat infinite 
limits of integration, see Ueberhuber [1]. 

As a first example with an infinite limit of integration, consider 


Oo L-2 
€ 
— dz. 
1 i 


We will make the substitution « = 1/u. This particular change of variable can be 
applied to transform any infinite integration interval into a finite length interval, 
provided zero is not contained in the original interval. For this problem, z = 1/u 
transforms the interval [1,00) into [0,1]. In terms of the new variable, u, the 
integration problem becomes 


i et oe = if. ew /u (-3) s [ ei/u hs 
i, 2@ 1 L/u u 0 wu 


Note that we have traded an infinite limit of integration for a discontinuity at u = 0. 
Since 


u-0t =U 
the discontinuity at u = 0 is removable. Proceeding as we have done with other 
removable discontinuities (programming the integrand as a piecewise function), we 
find the approximate value of the original integral to be 0.2193839344. 


For the integral 
lo.4) 
| te? dt, 
0 


the substitution t = 1/u will not have the desired effect since zero is contained in 
the original integration interval: After the change of variable, we would still have 
an infinite limit of integration. To circumvent this problem, we simply split the 
integration interval into two pieces, [0, a] and [a,co). Any nonzero value for a will 
work. For this problem, let’s choose a = 1 and break the integral into 


<0 2 1 a 2 a 
| Pet a= | et avs [ tre-* dt. 
0 0 1 


The first integral on the right-hand side is not improper and can be approximated 
directly. We find 


af 
[ e7" dt ~ 0.189472345819 
0 


using the adaptive Boole’s rule with an error tolerance of 2.5 x 107". In the second © 
integral, we make the substitution t = 1/u, producing 


ioe) 0 du 1 en l/u? 
/ te Far= | wre’ Se | qo du. 
1 1 7U 9) u 
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The discontinuity at t = 0 is removable, with lim,_9+ u~te-1/#" = 0, Using the 
adaptive Boole’s rule with the same error tolerance as specified above, we find 


i ene 
| du ® 0.253641116905. 
0 


us 


Therefore, 


foe) 
[ t?e~" dt = 0.189472345819 + 0.253641116908 
0 


= 0.443113462724. 


The exact value for this integral is \/7/4, so the error in the approximate value is 
roughly 2 x 107}. 
As a final general example, consider 


lee) 1 2 
fe 
doco L4+24+2 


We could split the integration interval into 3 pieces, say (00, ~1]U[-1, 1]U[1, 00), 
and then apply the substitution z = 1/u to the first and last subintervals. This 
approach is a bit cumbersome, though, so we choose instead to let 2 = tan 6, which 
allows the interval ~oo < a < oo to be treated as a single piece. With x = tan 6, 
the integral is transformed to 


‘ 1 a? ae 1 tan? g 264 f) 
ett 65 = —___-____,¢- tan d 
a it+a+a a, ie sec? @ + tand- ie 


n/2 1 2 
_ —tan*@ 
a/2 1 +sin@cosé 


The integral in @ has two removable discontinuities—one at each endpoint. Ap- 
proximating the value of this integral, we find 


/ oe da, * 1.5327082101, 
ee L 


Miscellaneous Examples 


EXAMPLE 6.22 An Integral from Electron Gas Theory 
The integral ro 
[ er tan? (1/2) dx 
0 


arises in calculating the exchange-correlation energy of an electron gas in a strong 
magnetic field (M. L. Glasser, “The Electron Gas in a Magnetic Field: Nonrelativis- 
tic Ground State Properties,” in Theoretical Chemistry: Advances and Perspectives, 
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Volume 2, H. Eyring and D. Henderson, eds., Academic Press, New York, 1976, pp. 
67-129). To isolate the handling of the discontinuity at the lower limit of integra- 
tion from the handling of the infinite upper limit, we break the problem into two 
parts as follows: 


oO 


co 1 
| ae tan '(1/2) dx = / em tan} (1/2) dz +f ee tan! (1/2) dx. 
Q 0 1 


The integration interval could have been split at any value of 2 x = 1 is just 
convenient. On [0,1], the discontinuity at z = 0 is removable since 


lim 7” tan? fe) ary 2: 


z—0+ 


Taking this limit into account, we find 
1 2 
| e* tan} (1/2) dx = 0.8901979553. 
0 


On [1,00), we introduce the change of variable x = 1/u, which yields 


oo L p,-l/u? -1 
é tan 
/ e* tan '(1/2)de = f eee ys 
1 0 


uz 


With 
we then find 


Recombining the two pieces, we have 


| e~® tan71(1/z) de ~ 0.9823019163. 
0 


EXAMPLE 6.23 The Period of Shock Wave Oscillations 


In obtaining an asymptotic estimate for the period of shock wave oscillations pro- 
duced during lithotripsy [see L. Howle, D. Schaeffer, M. Shearer, and P. Zhong, 
“Lithotripsy: The Treatment of Kidney Stones with Shock Waves,” SIAM Review, 
40 (2), pp. 356-371, 1998], the integral 


377 
af j- ypu 
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must be evaluated. The integrand is discontinuous at y = 1, and derivatives beyond 
the first order are discontinuous at y = 0. To handle these discontinuities separately, 
we will split the integration interval at y = 1/2. On [0, 1/2], we let y = x? and find 


1/2 ys V2/2 ; rT 
dy = 2 —~ dz ~ 0.072851 : 
i. i pw . ats] ae da = 0.0728516434 


Next, on [1/2, 1], we let 1 — y = x? and find 


ia car (= 25 
he ay= [2 J ast a gt dP © 0.6739825568. 
Vel 
a, 


EXAMPLE 6.24 Light Transmission Through a Crystal 


Suppose we need to evaluate 


Lr cosh [(1 + ¢?)~1/?] 
0 1+¢?? 


Therefore, 


ys 
=e dy ~ 0.9146813565. 


> 


which is a particular member of a family of integrals that arise in the study of light 
transmission through a crystal (M. L. Glasser, “Two Definite Integrals Arising in 
Light Transmission Through a Crystal: Problem 93-4,” SIAM Review, 35 (1), p. 
136, 1993). We could proceed with this problem as we have done on previous 
problems of the same type, splitting the integration interval into [0,1] U [1,00) and 
using the change of variable ¢ = 1/z on the latter subinterval. Given the structure of 
this integrand, however, it will be more efficient to let t = tan@. This trigonometric 
substitution transforms the integral into 


a {2 
[ cosh (cos 6) dé, 
0 


which has no discontinuities. Evaluating this integral, we find 


ror) 14+22)71/2 
| cosh [(1 +2) 7] ae we 19887316303, 
: 1+? 
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EXERCISES 
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In Exercises 1-3, 


531 


(a) Compute the value of the indicated definite integral using the trapezoidal rule, 
Simpson’s rule, the midpoint rule, and the two-point Gaussian quadrature rule, 
programming the integrand as given. Use n = 2, 4, 8, 16, 32, and 64 for each 
method. Compare the observed order of convergence with the theoretical value. 


(b) Repeat part (a) after making an appropriate change of variable in the integrand. 


1. 
2. 
3. 


de eV" dx 
a al? dep 
So sin(./z) dx 


For the integrals given in Exercises 4-13, identify each discontinuity/limit of integration 
which must be handled, then take appropriate action, and compute the value of the 
integral, accurate to at least ten decimal places. 


4, 


15. 


(2 88 as 


eae dx 
0 l+z 


| faze as 


i en” dr 


L = 
le wae 
eo Tees) Fe 
A he 22x dz 


QO 1+ 


| fy [Pee — 4) ae 
‘ {* en* Ore) de 


fore] 2 
5 fs (a7FT rsh) dg 
. Compute the value of the integral 


oO 
Ing 
—, a 
‘| tee 


accurate to at least ten decimal places in two ways: 


(a) making the substitution 2 = 1/1; and 
(b) making the substitution x = tan @. 


An integral of the form 
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has discontinuities at both endpoints of the integration interval. For integrals of 
this type, the substitution z = sin@ transforms the problem to 


i: £0) af” psino) a0. 


i V1 - 2? —n/2 


Evaluate each of the following integrals using this approach. 


1 Ed 
(a) hie Fat dz 
1 4 
(b) Jy seers oe 
1 ( 
(c) ft, BEB ae 
16. Repeat Exercise 15, but make the substitution z = cos @. 
17 


The integral 


is —t/a —2?/2 
G(t) -f ee dz 
0 


arises in studies of hopping transport for one-dimensional percolation (see J. 
Bernasconi, “Hopping Transport in One-Dimensional Percolation Model: A 
Comment,” Phys. Rev. B, 25, 1982, pp. 1394-1395). Evaluate G(1) and G(5). 


18. In determining the overlap interaction for the kinetic energy of a free electron 
gas, the integral 


K(a) = ie [(e-* + e7)* — (6% + e**)] dex 


arises (see W. Harrison, “Total Energies in the Tight-Binding Theory,” Phys. 
Rev. B, 23, 1981, pp. 5230-5245). In particular, the value of K(5/3) is needed. 
Evaluate K (5/3). 


19, Evaluate the integrals 


co 2 [oe 8 
x 
= dx and [ dx, 
o eal 6: eR 
whise arise in determining the photon density and the energy density, respec- 


tively, associated with blackbody radiation (see A. Beiser, Concepts of Modern 
Physics, McGraw-Hill, New York, 1981). 


CHAPTER 7 


Initial Value Problems of 
Ordinary Differential Equations 


AN OVERVIEW 
Same Background 


An equation in which an unknown function appears inside one or more derivatives 
is called a differential equation. When the unknown is a function of only one 
independent variable, the equation is referred to as an ordinary differential equation. 
On the other hand, when the unknown is a function of more than one independent 
variable, we call the equation a partial differential equation. Numerical techniques 
for approximating the solution of a partial differential equation will be treated in 
Chapters 9, 10, and 11. 

In principle, solving a differential equation requires integration. From our 
knowledge of calculus, we know that integration introduces arbitrary constants; 
therefore, the solution to a differential equation will contain one or more arbitrary 
constants. Values can be determined for these constants by imposing extra con- 
ditions on the solution. If these extra conditions are specified at more than one 
value of the independent variable, they are called boundary conditions. A differ- 
ential equation combined with a set of boundary conditions is called a boundary 
value problem. Chapter 8 discusses techniques for approximating the solution of a 
boundary value problem. When all of the extra conditions that have been imposed 
on the solution are specified at a single value of the independent variable, they are 
called initial conditions, and the combination of a differential equation and a set of 
initial conditions is called an intial value problem. 


Fundamental Mathematical Problem 


In this chapter, we will develop and investigate the performance of numerical tech- 
niques for approximating the solution of initial value problems of ordinary differ- 
ential equations. The basic problem we will address can be stated as follows: Find 
the function y(t) that satisfies 


y'(t)=f(t.y)), a<tsb 
y(a) = a, 


where the right-hand side function, f, and the initial value, a, are given. In the 
case of a single equation (also known as the scalar case), y and f will be real-valued 
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functions and @ a real numbcr. For a system of equations, y and f will be vector- 
valued functions [i.e., for each t, y(t), f(t, y(t)) € R® for some integer n| and a a 
vector. We will focus our attention at the beginning of the chapter on scalar initial 
value problems and then discuss how to generalize methods for scalar problems 
in. order to solve systems of equations. Higher-order differential equations will be 
handled by recasting the equation as a system of first-order equations. 

The “Projection Printing” problem capsule that was presented in the overview 
tc Chapter 1 (see page 6), is one application that gives rise to an initial value 
problem of ordinary differential equations. Here are two more examples. 


Population Growth in a Closed System 


Consider the growth of a species in a closed system. Let p(t) denote the population 
of the species at any time ¢, and suppose the initial population p(0) = po is known. 
A standard assumption in population modeling is that the instantaneous rate of 
change in the population is proportional to the population; that is, 


for some proportionality factor k. Note that & represents the overall growth rate 
of the population. To determine a specific functional form for k we make three 
additional assumptions. 


e The population exhibits a constant, intrinsic per capita growth rate; 

e Due to limited resources (such as food, water and shelter), a “crowding” 
effect causes a reduction in the overall growth rate which is proportional to 
the population; and 

e The accumulation of waste products in the closed system produces a “toxicity” 
effect which further reduces the overall growth rate by an amount proportional 


to i. p(r) dr. 
Bringing all of these effects together, we arrive at Volterra’s model for population 


growth 
Cie ra apmef p(rar D: 
at 0 


where r is the intrinsic growth rate, is the crowding coefficient, and c is the 
toxicity coefficient. 

As formulated, Volterra’s model is an integro-differential equation. Note that 
the unknown appears inside both an integral and a derivative. Following the pro- 
cedure of TeBeest [1], we can reduce this to a second-order differential equation. 
First, introduce the dimensionless variables 


pe and gant, 


A/c rf 
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This produces the nondimensional problem 


duu t _. _ Bo 
Gra ltcu- Puma], HO a ae 


where & = c/(rA). Now, change the dependent variable to y = Inu and differentiate 
the resulting equation with respect to t. This yields 


d?y ev dy 
gn - = (+9). (1) 


The initial conditions associated with (1) are 


y(0) = Inu(0) = Inug (2) 
and 0) 1 
’ _u _ 4740 
y'(0) = 7) a a (3) 


Equations (1), (2), and (3) form an initial value problem for determining the time 
evolution of the population. 


A Catalyzed Reaction 


Suppose two chemical compounds, A and B, react to form a product, P. When 
this reaction takes place in isolation, it proceeds slowly; however, when this reac- 
tion takes place in the presence of chemical compound C’, the rate of reaction is 
increased. Since its presence causes an increase in reaction rate, the chemical C is 
called a catalyst. It is believed that the catalyzed reaction proceeds according to 
the following two-step mechanism: 


A+C —+ X 
X+B — P+C. 


The compound X is called a catalyst-reactant complex. 

To model the evolution of this reaction, suppose that each step occurs at a 
rate proportional to the product of the concentrations of the reactants. Taking into 
account the rates at which each compound is produced and consumed, we arrive at 
the following system of five equations with associated initial conditions: 


[A]’ = ~Ay [A] [C], [A] (0) = Ao 

[B)' = —kz [B][X], [B] (0) = Bo 

[Cl = —ky [A] [C] + ke [B][X],  [C] (0) = Co (4) 
[X]! = ky [A] [C] — Ae [B] [X], [X] (0) =0 

[P]' = ke (B] [X], [P] (0) = 0 
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Here, |-] denotes the concentration of the indicated compound, primes indicate 
differentiation with respect to time, k, and ke are the rate constants for the two 
steps of the reaction, and Ag, Bo, and Cp are the initial concentrations of A, B, 
and C, respectively. 

Given values for the parameters k1, ke, Ao, Bo, and Co, a numerical solution 
to the initial value problem in (4) can be obtained. With a little extra analysis, 
however, the size of the problem can be reduced to just two equations. First, add 
the second and fifth differential equations in (4) to obtain 


((B)+([P]=0 = [B] + [P] = constant. 


Evaluating this last expression at t = 0 determines the value of the constant to 
be Bo. Solving for [P], we then find 


[P] = Bo — [B]. (5) 
Proceeding in a similar manner with the third and fourth differential equations 
in (4), we find 

[C] = Co - [X]. (6) 
A third relationship, [A] —[B]+[X] = Ao — Bo, is obtained after adding the fourth 
differential equation to the first and then subtracting the second. To simplify mat- 
ters, from this point forward we will assume that Ap = Bo. With this assumption, 


it follows that 
[X] = [B] - [A]. (7) 
Now, substitute (6) and (7) into the first two equations in (4) to obtain 
[A]’ = ka [A] (Co - [B] + [A]), — [A] (0) = Ao 8) 
[B]’ = —ke [B] ([B] - [A]), [B] (0) = Ao. 


Once the system (8) has been solved for [A] and [B], equations (5), (6), and (7) 
can be used to calculate [P], [C], and [X]. 
As a final preliminary step, let’s introduce the dimensionless variables 


a4 4-2 


= d r=k,Aot 
Ao y B Ao an 1440 
and define the dimensionless parameters 
ko Co 
R= i and A= a 


a =-a(\-P +a), a(0)=1 
B' = —KB(8 ~ a), B(0) = 1 


We would like to study the dynamics of this last system for various values of A 
and &. 
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Remainder of the Chapter 


Section 1 provides basic information related to the theory of initial value problems 
for ordinary differential equations and introduces some of the key numerical con- 
cepts that will be addressed throughout the chapter. Euler’s method, the simplest 
numerical technique for approximating the solution of an initial value problem, is 
then presented in Section 2. More advanced techniques, including Taylor methods, 
Runge-Kutta methods, Adams-Bashforth methods, and Adams-Moulton methods, 
are developed in the next three sections. Convergence analysis for these techniques 
is presented in Section 6. Section 7 explores the issue of error control and the 
construction of variable step size algorithms, and Section 8 discusses the solution 
of higher-order differential equations and systems of differential equations. The 
chapter concludes with a discussion of absolute stability and stiff equations. 
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7.1 KEY NUMERICAL CONCEPTS AND ELEMENTS OF CONTINUOUS THEORY 


For the time being, we will restrict our attention to the scalar first-order initial 
value problem 
y(t) =f(t,y(f), ast<b 
y(a) =a. 


(1) 


Although the true solution of (1), y(t), will be a continuous function defined for all 
a<t<_b, the approximate solution we obtain will consist of individual approxi- 
mations to the values of y at only a discrete set of times, say 


a=to <ty <te <-+-+<tny-1 <tn =). 


Furthermore, we will adopt the following notation: let y; denote the value of the 
true solution at t = t; and let w; denote the approximation to y;. In other words, 
throughout our discussions, we will have 


wy = y(t). 


The techniques developed in this chapter are all based on the replacement 
of the differential equation in (1) by a difference equation. Once this replacement 
has been made, the approximate solution is computed via a “time marching,” or 
“time stepping,” procedure. Starting from wo = a, the initial value of the true 
solution, the difference equation is used to determine w;—this essentially marches 
the solution from t = tg forward to t = t,. Next, w2 is determined from w, (and 
possibly also wo), w3 is determined from w2 (and possibly also w, and wo), and 
so on. The procedure terminates when wy has been computed—time has been 
marched out to ty = b. 


538 Chapter 7 Initial Value Problems of Ordinary Differential Equations 


One-Step versus Multistep Methods 


We will consider two different types of initial value problem solvers in this text: 


one-step methods and multistep methods. The general form for a one-step method 
is 

Wig — Uy 

ae a = O(f, ti, Wi, Wir, hi), (2) 


where hy = ti41—-t;. The rationale behind the nomenclature of a “one-step method” 
is clear from the form of the difference equation in (2): The computation of wii 
requires knowledge of w; only, where w; is the approximation made one step prior 
to the present value. 

If the function ¢ is independent of wi41, then the difference equation can 
be solved explicitly for wii1, so the method is said to be explicit. For an explicit 
method, knowing the right-hand side function f and given values for t;, hy, and wy, 
we can immediately calculate the value of wi41. When ¢ does depend of wy41, the 
difference equation defines the value of wiz: only implicitly, so the method is said 
to be implicit. For an implicit method, some sort of rootfinding technique, such as 
fixed-point iteration or Newton’s method, must be applied during each time step 
to determine the value of w,41. 


EXAMPLE 7.1 Explicit versus Implicit 


Consider the difference equation 


Wi+1 7 Wi 


3 = f(ti, wi). 


Here, O(f, ti, hi, Wi, Wit1) = f (ti, ws), So this equation fits the pattern of an explicit 
one-step method. In particular, this difference equation is known as Euler’s method, 
which we will examine in depth in Section 7.2. As a second example consider the 
difference equation 


- rane F 1 
“= = sl fltiws) + Sti + h,wisa)], 
4 


which has 


OF, ta, Re, We, Wo) = s lft. un) + f(ts + Re, wisr)). 


This equation, which is known as the trapezoidal method, therefore fits the pattern 
of an implicit one-step method. 

To illustrate more clearly the practical difference between an explicit method 
and an implicit method, suppose we were to apply Euler’s method and the trape- 
zoidal method to the differential equation 
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Comparing this specific equation to our model problem, we identify f(t, y) = ¢siny. 
The difference equation for Euler’s method then becomes 


Wit. — Wi F 
—_— =tsinw, 
hi 


from which we obtain the explicit expression w;41 = w; + hit; sin w,;. To calculate 
Wi41, we only need to have values for w;, h;, and t; and then perform some basic 
arithmetic operations. On the other hand, substituting f(t,y) = ¢siny into the 


difference equation for the trapezoidal method yields 
Wit1 — Wi 1 
a in 5 [ty sin ws + (t; + hy) sin wisi] F 
Z 3 


There is no way to explicitly solve this equation for wi41. Hence, even given values 
for w;, h;, and t;, we will have to solve a nonlinear equation to determine w;41. 


If one-step methods are thought of as using information regarding where we 
currently are to predict where we should be at the next time step, then multistep 
methods use information regarding where we are and where we have been to make 
that prediction. For example, a two-step method would use both w; and w,-1 to 
compute w;41, while a four-step method would use w;, wii, wi_2, and w;-3. The 
general linear m-step method can be written in the form 


Wit) 7 G1 Wy — AQWi-1 — 9 7 Gm Witi-m = 
hy 
bof (tear, Wit1) + Or f (ty, wa) + bof (te-1, wi-1) + °° + Om f (tigi—ms Witi-m). (3) 


The method is explicit when bo = 0, and implicit otherwise. 


EXAMPLE 7.2 A Couple of Multistep Methods 
Consider the difference equation 


Wit, — Wi — 23 4 5 
= St; wy) — Sf (ten, wie) + ef (ts—2, Wi—2)- 

hi arta 41 Wi) gf (ti 1, Wi 1) jot tt 2, Wi 2) 
Since the smallest subscript that appears in this equation is 7 — 2, which is three 
less than i + 1, this is a three-step method. Furthermore, because w;41 does not 
appear inside the function f, this is an explicit method. Matching the difference 
equation to equation (3), we see that the values of the various coefficients are 

23 4 5 
a,=1, @2=a3=0, bb =0, Y= — Sse, = 


As a second example, consider the difference equation 


4 l 
Wit. — FWe + zwWi-1 2 
- ae $= gd (tiv, Wi41)- 
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This is an implicit two-step method because w;+1 does appear inside the function f 
and the smallest subscript in the equation is i— 1. Matching the difference equation 
to equation (3), we see that the values of the various coefficients are 


1 
a= 3? 3? 


RR 


2 
a= —- ae by = bo = 0. 


Fundamental Numerical Concepts 


The analysis of numerical methods for initial value problems, be they one-step 
or multistep methods, involves two different types of errors. One measures how 
well the difference equation that defines the method approximates the differential 
equation; the other measures how well the solution of the difference equation ap- 
proximates the solution of the differential equation. The sensitivity of the solution 
of the difference equation to changes in the initial condition is also considered. 
The error associated with the approximation of the differential equation is 
called the local truncation error, which we shall denote by the symbol 7;. We refer 
to this error as “local” because it measures the error generated by one step of the 
method, assuming the solution at previous steps was exact. As such, truncation 
error is determined by substituting the solution of the differential equation into the 
method’s difference equation. Recalling equations (2) and (3), it follows that for a 
one-step method 
7 — BT 


j ne POS ti Yas Yaa Ma) 
% 


and for a linear multistep method 


m 
_ Yet ~ jj Unis 


™% = i - SO (tar gia): 
% 


j=0 


If 7 —- 0 as hy — O, then the difference equation is said to be consistent with 
the differential equation y’ = f(t, y) and the numerical method is called consistent. 
Additionally, if 7; = O(A2) for some constant p, the method is said to be of order p. 

Of course, the error in which we are most interested is the difference between 
the solution of the differential equation and the solution of the difference equation, 
yj,--w;. This is known as the global discretization error and measures the cumulative 
effect of the errors introduced by all of the time steps taken. One might expect 
that global error would be a monotonically increasing function of the number of 
steps taken, but this is not necessarily the case. We will encounter examples in 
subsequent sections for which global error oscillates, as well as examples for which 
global error monotonically decreases. 

Our true interest, however, is not in the behavior of global error for a fixed 
partition of the interval a < ¢ < 6. Rather, our concern is with the behavior 
of global error as the partition is refined. In particular, we want to increase the 
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number of time steps, NV, in such a way that h = max(t,41 — t;) tends to zero, but 
fix ty = b regardless of the value of N. If 
RE (ay he — wal = 0 

we say the method is convergent. 

Another desirable property for a numerical method to have is stability. For 
a given numerical method and a fixed partition of a < t < b, let {w;} denote the 
sequence of values generated from an initial condition of a and {ww;} denote the 
sequence generated from an initial condition of &. The numerical method is stable 
if and only if there exists a function k(t) > 0 such that 


|i, —wi| < k(ti)|@ = a| 


for alli. The key point is that the function k(t;) must be independent of h. The 
method is unstable when no such bound exists. 

We’ve just introduced three important properties for numerical methods for 
initial value problems—consistency, convergence, and stability. The most important 
of these properties is convergence. If the solution of the difference equation does not 
tend toward the solution of the differential equation as the computational partition 
is refined, the method is useless. 


Elements of Continuous Theory 


Before we start developing and investigating specific numerical methods for initial 
value problems, it is worthwhile taking a few moments to examine certain aspects 
of the theory of initial value problems. We will focus on three areas: existence, 
uniqueness, and stability. Existence deals with the question of whether a given 
initial value problem has a solution. Uniqueness takes this issue one step further. 
If an initial value problem has a solution, does it have only one? The concept of 
stability for an initial value problem is similar to that of stability of a numerical 
method and deals with sensitivity of the solution to changes in initial conditions. 
We will make this statement more precise below. 

We are interested in the theory of initial value problems for two reasons. On 
the one hand, the conditions under which an initial value problem will have a unique 
solution and be stable will place the conditions under which a numerical method is 
consistent, convergent, and stable into perspective. On the other hand, the theory 
of initial value problems has certain practical implications. For example, suppose 
that a particular initial value problem does not have a solution. In this case, it will 
not matter how sophisticated a numerical method we have at our disposal. Any 
calculated values will be meaningless. The same is true if an initial value problem 
does not have a unique solution. 

What practical implications are associated with sensitivity to changes in initial 
data? Due to the presence of roundoff error, when we implement our numerical 
methods, we will never actually be solving the original initial value problem 


y=flty), ye) =a. 
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Instead, we will be solving the perturbed problem 
z= f(t,z)+d(t), z(a)=ate. 


Our hope is that the solutions of these problems are “close”; that is, that |z(¢)—y(¢)| 
is bounded on some interval a < t < b. More precisely, we are counting on the 
existence of positive constants ¢ and k such that, whenever 


leol<e and |d(t)|<e for allt [a, 4], 
the perturbed problem has a unique solution that satisfies 
\z(t) — y(t)| < ke 


for all t € [a,b]. If these conditions hold, then the original initial value problem is 
said to be siable. Thus, only when the original problem is stable can we expect the 
numerical solution to provide a reasonable approximation to the solution of that 
original problem. 

An initial value problem that has a unique solution and is stable is said to be 
well-posed. An important tool for establishing that an initial value problem is well 
posed is the Lipschitz condition. 


Definition. A function f(t, y) satisfies a LIPSCHITZ CONDITION in y on the 
set D c R? if there exists a constant L > 0 such that 


lf(t,a1) — f(t. ye)| < Llyn — ye| 


for all (t,y1), (t,y2) € D. The constant D is called the Lipscuirz CONSTANT 
for f. 


Note the central role that the set D plays in this definition. A given function 
may satisfy a Lipschitz condition on one set, but not on another. Furthermore, the 
value of LZ may depend on D. 


EXAMPLE 7.3 Determining Lipschitz Conditions 
Let f(t, y) = ysint and take D = R?. Since 
f(tjyn) — f(t, yo) = yi sint — yo sint = (y. — ya) sine 
for all (t, 41), (t, y2) € D, it follows that 
f(t, a1) — f(t,42)] = [sin tly — yl S lan — gal 


for all (t,x), (tye) € D. Hence, f satisfies a Lipschitz condition in y on all of R?, 
with Lipschitz constant D = 1. 
As a second example, consider the function 
by? 
fly) = ros 
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Let M > 0, and take D = {(t,y)|t¢ R,-M <y < M}. Here, we have 


Py — PyB 
ty) — f(t,yo) = = 
f( yi) f( y2) 1+ 1+ 
2 2 
Ten (uy? ~ 3) = Tae ot + ya) — yo), 
from which it follows that 
#2 
ifsv) ~ s(t.ve)l = | | In + ll — 
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Now suppose (¢, yi) and (, yo) are both in D. For any t € R, 0 < #?/(1 +2?) <1. 
Further, since —M < y1, yo < M, we have |y; + yo| < |gi| + |ye| < 2M. Therefore, 


for all (t,y1), (t,2) € D, 


lf(t,¢1) — Ft, y2)| < 2M lan — yal, 


and f satisfies a Lipschitz condition in y on D with Lipschitz constant L = 2M. 
Note the dependence of the Lipschitz constant on the set D. Also note that, 
although f satisfies a Lipschitz condition on D for any fired M, we cannot extend 
this result to all of R?. If we try to take M — 00, we will no longer be able to 


obtain an upper bound for the quantity jy, + yel. 


It can be difficult using the definition to establish that a function satisfies a 


Lipschitz condition. The following result can be useful in certain instances. 


Theorem. If f is defined on D = {(t,y)|a < t < b,c < y < d} and there 


exists a constant L > 0 such that 


| of 


hy] st 


for all (t,y) € D, then f satisfies a Lipschitz condition in y on the set D with 


Lipschitz constant LD. 


Proof. This is a generalization of the Mean Value Theorem. Fix ¢ € [a, 6], 
and let (¢, 41), (¢, y2) € D. Apply the Mean Value Theorem to f, with t fixed, 


to obtain 


f(t,y.) — F(t, y2) = S(t e)(n — y2) 


for some € between y; and yo. The result follows upon taking the absolute 


value of this last expression and applying the stated bound on |Of /Oy|. 


Oo 
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EXAMPLE 7.4 Another Lipschitz Condition 
Consider the set D = {(t,y)|0 <t<1,0<y < w/4}, and let 


tsecy 
ty) = 
F(t,g) = es 
Here, we have 
Bh t 
ig 5 secy tany. 
For 0<t¢ <1, |é/(t—2)| < 1, and for 0 < y < 7/4, |secytany| < /2. Therefore, 


< V2 for all (t,y) € D, 


of 
dy 


so f satisfies a Lipschitz condition in y on D with L = V2. 


The following theorem links the Lipschitz condition with the concept of well 
posedness. More general statements can be made, but this will be sufficient for our 
purposes. For a proof of this theorem, see Coddington [1]. 


Theorem. Let D = {(t,y)ja <t <T,|y— | < 6} for some 6 > 0. Suppose 
that the function f(t, y) is continuous on D and satisfies a Lipschitz condition 
in y on D. Then, there exists an interval [a,b] C [a,T] such that the initial 
value problem 

y=f(t,y), y(a)=a 


is well posed on a <t <b. 
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EXERCISES 


1. Identify each of the following difference equations as representing a one-step 
method or a multistep method and as being implicit or explicit. 


itl — wi _ 3 1 
(a) oe = 5S (tims) — af (ti-1, wi-t) 
Renee hy h 
(b) ae Wi = (+ fm + * tan) 
Wity — Wi _ 2 le wisn) + a Wi) — A Fi 1, Wi-1) 
(c) —_— >, = 19 itl, Wit) 3 4, Mt 12 ie a as 


(d) aa ae = f(t, wi41) 


a1 — 4u; + 3. 
(e) bh SETH = 2 F(te1, wins) 
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. Each of the following difference equations represents a linear multistep method. 
Identify the number of steps, m, and the values of the coefficients a; and 6;. 


Wi41 — Wie 1 
(a) —— aa [f (ti+1, wig1) + 4f (ti, wi) + f(K-1, wi-1)} 
itl — Wie 4 
(b) ar ad 5 [2f (bs, we) — f(le-1, wi-1) + 2f (Hi-2, wi_a)}] 
Wy. — Wi- 
(c) Sa = 2f (tz, wi) 
Witt ~ Wie-2 
(d) 7 = 
7 [f (spa, Witr) + 3f (te, wi) + 3F (t-1, wi-1) + f(ti~2, wi-2)] 
(e) ae Wi _ : 
ogi tien, Wit) + ag J (tis wi) - ag i (ti-1, Wi-1) + gq f (ti-2, wi-2) 
Witl — B io } i 1 3 
(f) eee = Ff (titi, wir) ~ 3G f(te, wi) + 7 F(t-1, wir) 


. Show that each of the following functions satisfies a Lipschitz condition in y on 
the indicated set D. 


(a) f(y) =ty, D={(ty|-1<ts1,0<y< 10} 

(b) f(hy) =tVo/(l+t), D={(tylteRy2V 

(c) f(hu)=e/y, D={(ty)]0Sts2y2]} 

(4) f(y)=l-yte%y’, D={(ty)]0St<1,-5 Sy <5} 

(ce) f,y)=y?-t/y, D={(ty)|-2<t<2,1<y< 10} 

. Let M > 0. Show that the function f(t,y) — ¢<? ~ 4y/t satisfies a Lipschitz 
condition in y on any set D of the form { (¢,y)|t > M,y € R}. Does f satisfy a 
Lipschitz condition in y on the set D = {(t,y)|t > 0,y € R}? 


. Let M > 0. Show that the function f(t,y) = i(y? — y) satisfies a Lipschitz 
condition in y on any set D of the form {(£,y)|-l1<¢t<1,-M <y< M}. Does 
f satisfy a Lipschitz condition in y on the set D = {(t,y)| -l1<t< lye R}? 


. Let M > —1. Show that the function f(t, y) = 2ty/(y + 1) satisfies a Lipschitz 
condition in y on any set D of the form {(#,y)|O <¢ < 1,y > Mj. Does f 
satisfy a Lipschitz condition in y on the set D={(t,y)|[O<t<1l,y>—1}? 

Consider the initial value problem y’ = f(t,y), y(a) = a. If we integrate 
both sides of the differential equation between a and t, we obtain 


t 
y(t) — y(a) = | f(z, y(2)) dz. 


Using the initial condition and some algebra, this becomes 


t 
yi) =a | f(z, y(2@) dz. 


Now, let yo(¢) = @ and define the sequence of functions {y,(t}} by 


t 
AOz=oe ‘| Ae ese 
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for k = 1,2,3,.... This technique is called Picard iteration. Under certain 
conditions, this sequence can be shown to converge to the solution of the initial 
value problem (see Coddington [1]). 


‘In Exercises 7-10, perform Picard iteration to determine the indicated function in the 
sequence {y,(£)}. 
Ty =U —dy?—1, (0) =0, yolt) 
8. y= 2e'-y, y(0)=1, ya(t) 
%y=t-y, y(0)=1, ys(t) 
10. y =i +y", y(0)=1, ya(t) 


Recal] that the Taylor series for the function y(t) about the point ¢ = a is given by 


y*)(a) 


tayo PMG ate... 


ula) + of(a)(t— a) + 2, 


If we are trying to solve the initial value problem y’ = f(t, y), y(a) = a, notice that 
the initial condition provides the value of y({a) and, if we substitute ¢ = @ into the 
differential equation, we can calculate y/’ (a). Differentiating the differential equation 
with respect to ¢ and substituting ¢ = @ will then give the value for y"(a). Continuing 
in this fashion, we can, in principle, obtain as many terms in the Taylor series for y(t) 
as we desire. 
In Exercises 11-14, use this approach to determine the first five terms in the 

Taylor series expansion of the solution of the indicated initial value problem. 

Wy =y-?, y(0)=1 

12, y=y-t, y(0)=2 

13. y =e'/y, y(0)=1 

14. y' =sint-y, y(0)=1 


7.2 EULER'S METHOD 
Consider the scalar, first-order initial value problem: 
y'(t)=f(t,y()), asts<d 
(1) 
y(a) = @. 


Our objective is to determine a numerical approximation w ~ y, where y(t) is the 
exact solution of (1). As indicated in Section 7.1, we will determine values of w at 
the discrete set of points 


a=t<t)<te<---<tn_1<tn =), 


onty, and we will adopt the notational convention that w; represents the approx- 
imation to y; = y(t;). For simplicity, the approximate solution will be sought at 
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equally spaced points; that is, for some positive integer N’, we will define the step 
size 
h=(b-0)/N, 


and then the ¢; will be given by 
th=a+ih (¢(=0,1,2,...,N). 


Derivation of Method 


Euler’s method is the simplest of the one-step methods for approximating the so- 
lution to the initial value problem (1). The derivation of the method begins by 
assuming that the true solution of (1), y(t), has two continuous derivatives. Ex- 
panding this true solution in a Taylor series about the point ¢ = t; produces 


u(t) = ye + (b- Heh + 5 (6 ty"), 


where € is guaranteed to lie between t and t;. Evaluating the above Taylor expansion 
at t = tiz1 and substituting for yj from the right-hand side of the differential 
equation, we obtain 


a 
Yisr = Yet Af (tis ya) + shy" (6). 


Euler’s method arises by dropping the error term and replacing y; (exact 
solution) by w; (approximate solution): 


Wo = a 


2 
Wig = Wi t hf (ti, wi) 4=0,1,2,...,.N—-1. (2) 


It is worth noting that Euler’s method can also be derived in other ways. For 
example, if we were to first evaluate the differential equation in (1) at t = t; and 
then replace the derivative term by the first-order forward difference approximation 


ey. _ var w hy, 
yf (ti) = —S— — gy"(E), 
we would obtain h 
Ba Sy (6) = fen). 


Upon dropping the error term, replacing y; by w;, and solving for wi41, we would 
reproduce (2). Yet another derivation begins by integrating both sides of the dif- 
ferential equation in (1) from t = 1, to t = t;41 to give 


tit 
wi-w= f f(t, y) dt. 
ty 


Approximating the integral of f(t,y) using a left-endpoint approximation again 
reproduces (2). 
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EXAMPLE 7.5 Euler’s Method in Action, Problem 1 


The initial value problem 


has as its exact solution z(t) = t(1+ Ind). For this problem, f (t,x) is given by 
f(t,2) =1+2/t, so that the Euler’s method difference equation takes the form 


wo =1 

Wi 
Wier = Wz, thil ws) 
i+] 4 + (a+ 2) 


Let’s use a step size of h = 0.5, which will require ten steps to advance from t = 1 
tot=6. With t) = 1 and wo = 1, we calculate 


my =p +h (14+ 22) =1405(14t)=2 
0 


Advancing the value of the independent variable from tp to 4) = t9 +h = 1.5, we 
then calculate 


to = wth (1 a 2) = 240.5 ¢ =f =) = 3.16666667. 
1 . 


Continuing in this fashion, we obtain the following results, with the value of the 
exact solution listed for comparison. 


¢ Approximate Solution Exact Solution — |y(t;) — wi| 


1.0 1.00000000 1,00000000 

1.5 2.00000000 2.10819766 0.108198 
2.0 3.16666667 3,38629436 0.219628 
2.5 4.45833333 4.79072683 0.332393 
3.0 5.85000000 6.29583687 0.445837 
3.5 7.32500000 7.88467039 0.559670 
4.0 8.87142857 9.54517744 0.673749 
4.5 10.48035714 11.26834829 0.787991 
5.0 12.14484127 13.04718956 0.902348 
5.5 13.85932540 14.87611451 1.016789 
6.0 15.61926407 16.75055682 1.131293 - 


Observe the slow but steady growth in the global error as t increases. Since 
each step introduces new error into the computed approximate solution, we might 
expect this type of behavior in every problem; however, the actual accumulation of 
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global error is very problem dependent. Essentially, the error introduced by each 
step of the time marching process moves us from one solution of the differential 
equation onto a different solution. If, as in the previous example, nearby solutions 
separate from one another as ¢ increases, we can expect to see a steady increase in 
the global error. On the other hand, if nearby solutions move closer together as ¢ 
increases, we could expect to observe a steady decline in the global error. The next 
example demonstrates this situation. 


EXAMPLE 7.6 Euler’s Method in Action, Problem 2 
The initial value problem 


dr ¢t 

Be Se pepe 
dt ot’ Stss 
2(0) = 


has as its exact solution x(t) = Vt? +1. Here the Euler’s method difference equa- 
tion takes the form 


Wo = 1 
bi 
Wit = Wit ore 


Let’s once again use a step size of h = 0.5. In the first time step we calculate 
t 0 
wy, = wo Hh =140.5~ =1. 
Wa 1 


Advancing the value of the independent variable from tp to 4] =tp +h = 0.5, we 
then calculate 


We = Wy aneh a 140522 = 1.25, 
Wy 1 
Continuing to time step in this fashion until we reach t = 5, we obtain the following 
results, with the value of the exact solution listed for comparison. After an initial 
increase in the global error, note the steady decline as t advances. 


t Approximate Solution Exact Solution  |y(¢;) — wi 


0.0 1.00000000 1.00000000 

0.5 ~  1.00000000 1.11803399 0.118034 
1.0 1.25000000 1.41421356 0.164214 
1.5 1.65000000 1.80277564 0.152776 
2.0 2.10454545 2.23606798 0.131523 
2.5 2.57970744 2.69258240 0.112875 
3.0 3.06425851 3.16227766 0.098019 
3.5 3,90377339 3.64005494 0.086282 
4.0 4.04620768 4.12310563 0.076898 
4.5 4.54049768 4.60977223 0.069275 


5.0 5.03603807 5.09901951 0.062981 
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Let’s perform one more numerical experiment before examining the theoretical 
aspects of Euler’s method. We expect that the accuracy of the approximate solution 
generated by Euler’s method will improve if we decrease the step size h, but how 
a aaa will we obtain? Is the global error O(h)? Is it O(h?)? Is it 

{h . 


EXAMPLE 7.7 Numerical Investigation of the Rate of Convergence of 
Euler's Method 


Reconsider the initial value problem 


whose exact solution is 2(t) = t(1+Int}. The following table displays the absolute 
error in the approximate solution at t = 6, obtained from Euler’s method using 
successively smaller step sizes. Note each time the step size is cut by a factor of 2, 
the absolute error shrinks by roughly the same factor. This suggests that the global 
error associated with Buler’s method is O(h). 


h Approximate Solution Absolute Error Error Ratio 


1/2 15.61926407 1.181293 
1/4 16.15574907 0.594808 1,90195 
1/8 16.44564019 0.304917 1.95072 
1/16 16.59620493 0.154352 1.97546 
1/32 16.67290649 0.077650 1.98778 
1/64 16.71161299 0.038944 1.99391 
1/128 16.73105524 0.019502 1.99696 
1/256 16.74079861 0.009758 1.99848 
We arrive at the same conclusion when we reconsider the initial value problem 
dz t 
Ray OStSS (0) =1. 


The table below displays the absolute error in the approximate solution at t = 5. 
Once again, each time the step size is cut by a factor of 2, the absolute error shrinks 
by roughly the same factor. 


h Approximate Solution Absolute Error Error Ratio 


1/2 5.03603807 0.062981 

1/4 5.06611825 0.032901 1.91426 
1/8 5.08234203 0.016677 1.97280 
1/16 5.09063747 0.008382 1.98967 
1/32 §.09481923 0.004200 1.99559 
1/64 5.09691725 0.002102 1.99798 
1/128 5.09796787 0.001052 1.99903 


1/256 5.09849357 0.0005 26 1.99953 


oe 
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Analysis of Euler’s Method 


We will begin by examining the local truncation error for Euler’s method. Recall 
that truncation error measures how well the continuous differential equation has 
been approximated by the discrete difference equation and is determined by sub- 
stituting the true solution into the difference equation. Given the first derivation 
of Euler’s method, it is clear that 


a VT tte yy = are, 
G%= h f(t ys) = 34 (&). 


Assuming that y has two continuous derivatives, we have 
if) 


| <= max 
\7s| < 2 telab] ly 


? 


which implies that 7; = O(h). 

From this last expression we can make two important observations. First, 
since 7; = O(h) (i.e., the local truncation error goes like the first power of the step 
size), it follows that Euler’s method is a first-order scheme for approximating the 
solution of an initial value problem. Second, since 7; —~ 0 as h — 0, we see that 
Euler’s method is consistent. 

Next, consider the global discretization error associated with Euler’s method. 
Recall that this measures how well the true solution of the differential equation has 
been approximated. In particular, the global error at £ = f; is given by y; — wi. 
Note that this a kind of accumulated error, since w; depends on the previous ap- 
proximations wj-1, Wj-2, Wi-3,...,W1- 


Theorem (Global Error for Euler’s method). Let y(t) be the unique 
solution to the initial value problem 


y(t)=f(t,y(), a<t<b 


y(a) =a. 
Let wo, wi, w2,...,wn be generated by Euler’s method: 
Wo =a 


Wiz. = we t+ hf (ti, uw), 
where h = (b—@)/N and t; =a+ih. If f satisfies a Lipschitz condition in y 


on D= {{(t,yj|a<t<b,y € R} with constant L and there exists a constant 
M such that 


then 
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Remarks. (1) Note that |y; —w;| = O(h), which confirms our earlier numer- 
ical evidence. 

(2) It is clear that |y; — w;] > 0 as h > 0, so Euler’s method is convergent. 
(3) One weakness of this theorem is that we don’t know M, the bound on the 
second derivative of the true solution. If Of/Gy and Of /Ot exist, then 


n _ OF OF ye OF 2008 
i at By! = Bt aye 


which can be used to obtain M. 


The proof of this theorem requires two lemmas. We will state and prove both 
lemmas now and then present the proof of the theorem. 


Lemma ji. For all x > —1 and any positive m, 
Peay Ser. 


Proof. Since m must be positive, it is sufficient to establish that 0 <1+2< 
e*. Expanding e* in a Taylor series about zo = 0 with n = 1 gives 


1 
Paltat sare, 
where € is between 0 and z. Then, for all x > ~1, it follows that 
l 2 6 x 
Osltesltatrazre =e". O 


Lemma 2. If s and t € R* and {a,} is a sequence satisfying ag > —t/s and 


aig) < (14+ s)a; +t, 


“ele 
diy eft) (: + ao ~~ 


Proof. For each fixed ¢, 


then 


(1+ s)a; 

(1+) Ea: r+t+t=(1+s)*ai_1 + t[1+ (1+) 
(L+s)?[(1 + s)ai-2 +4] + e411 + (1 +5) 

ete oté{l+(1+s)+ (+s)? 


Qi+1 


IA IA A 


MI 


< (1 +5)* ap + [1+ (1 +s) + + (+s) + ~+(1+s)']é. 
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The factor multiplying ¢ in the last line is a geometric series whose sum is 
(1 + s)**! — 1]/s. Therefore, 


; t 
int < (1+)? + = [+5]! - 1] 


We are now in a position to prove the theorem on the global error for Euler’s 
method. 


Proof. The error formula is clearly true for 1 = 0 since yo = wo = a. For 
4=0,1,2,...,N-1, 


h2 
y(tisi) = ye) + hf (ti, ya) + au (&) 
by Taylor's Theorem, and 
Wi41 = Wi + hf (ti, wy) 


by the definition of Euler’s method. Upon subtracting these two equations, 
we have 
h? 
Yet — Wit = Yi — wi + Alf, Ys) — Flt, wid] + yu (&) 


and 


h? ; 
lyin — Wirt! S lye — wal + ALF (ta, ye) — Fe, wad + ye" El: 


Using the Lipschitz condition on f and the bound on y” gives 


h? 
lyin — Weail < lye — wi[(1 + AL) + —M. 


2 
Identifying a; = |y; ~ wi], ¢ = AL and t = h?M/2 in Lemma 2 gives 
; AM AM 
lyiea — wiga| < elo DAL (Iv — wol + ar) ~ Op 


= ee _ 1) AM 


But G + I)h = ti) — to = ti41 — @, 80 


AM 
Vit. ~ Wear| = OL fete ae - 1 : im 
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EXAMPLE 7.8 Confirming the Global Error Bound for Euler’s Method 


Let’s once again consider the initial value problem 


dx z 
ene eee <t< 
de tT re 1<t<6 
e(1) =1. 
Since f(t,z) = 1+ 2/t, it follows that 
ry —- 2 
fm) ~ flea) = B= 
<|z,-2.| forl<t<6. 


Hence, L = 1. The exact solution of the initial value problem is x(t) = t(1 + Int), 
so x(t) = 1/t and 
M = max |z"(t)| =1. 
1<t<6 


With h = 0.5 and a= 1, the Global Error Theorem guarantees that 


0.5: 1 _}). 1 = 
pm Eom a] bo a 
The table below compares the actual error with this error bound for each t,. Note 
that for this problem, the error bound is significantly larger than the actual error. 


t  |y;—w;|  Exror Bound 
1.5 0.108198 0.162180 
2.0 0.219628 0.429570 
2.5 0.332393 0.870422 
3.0 0.445837 1.597264 
3.5 0.559670 2.795623 
4.0 0.673749 4.771384 
4.5 0.787991 8.028863 
5.0 0.902348  13.399538 
5.5 1.016789 22254283 
6.0 1.131293 36.853290 


The global error theorem for Euler’s method neglected the effect of roundoff 
error in the calculation of the approximate solution. To examine this effect, let 


Wo =a 
wea = wi thf (ts. wi) 
denote the Euler’s method approximation obtained using exact arithmetic, and let 


Ug = + 5g 
Wig = Uy + hf (ta, us) + 45 
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denote the approximation obtained with finite precision arithmetic, where each 4; 
represents the roundoff error introduced during the ith step of the solution process. 
We then have the following result: 


Theorem. Let y(¢) be the unique solution to the initial value problem 


y(t) =f(t,y(f), a<t<b 


y(a) =a, 
and let uo, U1, U2,..., ty be generated by Euler’s method with finite precision 
arithmetic: 
uo = a+ do 


Wie] = Us + hf (ti, ua) + 6; 


where h = (b—a)/N, t; = a+th and |6;| < 6 for each i = 0,1,2,...,N. If 
f satisfies a Lipschitz condition in y on D = {(t,y)|a < t < b,y € R} with 
constant L and there exists a constant M such that 


max |y’"(é)| < M, 


té[a,b| 
a 1fhM 58 
cr lapel et See ff eee Be L(ti-a) _ L(ts—@) 
lus ul ez ( ; +?) le 1| + de 


Proof. The proof of this theorem follows the same steps as the proof of the 
previous theorem with all w’s replaced by u’s, except that 


Yo = uo| = do, 
and in the application of the Lemma 2 


2 
t= nee +6. Oo 

2 
Note that when the effect of roundoff error is taken into account, the global 
error is composed of two competing forces. On one hand, there is the local trunca- 
tion error, which decreases linearly with h. On the other hand, there is the roundoff 
error, which is inversely proportional to the step size. Thus, when h.M/2 is much 
larger than 6/, truncation error will dominate, and we can expect to observe O(h) 
convergence toward the exact solution. However, for sufficiently smal] values of 
the step size, roundoff error will dominate, and we can expect global error to grow 
with decreasing h. This situation is identical to the phenomenon we observed with 

numerical differentiation in Section 6-2. 

The final analysis issue that we will address is stability. Suppose we apply 

Euler’s method to the initial value problems 
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and F 
= f(t,y), y(a) =a, 


Let w; denote the approximate solution associated with the initial condition y(a) = 
c@ and w; denote the approximate solution associated with the initial condition 
y(a) = & Provided f satisfies a Lipschitz condition in y, @ procedure similar to 
the one used to prove the global error theorem can be used here to establish that 


|w; — wi| < k(t)la — al, (3) 


where k(t;) = e(-%" and L is the Lipschitz constant associated with f. The 
details of this derivation are left as an exercise. Since k(t,) is independent of h, 
it follows that Euler’s method is stable with respect to perturbations of the initial 
condition. 


Ar Application: Modeling the Spread of an Epidemic 


Consider the following simple model, proposed by Kermack and McKendrick [1], 
for the spread of an epidemic. The population is divided into three categories: 
healthy individuals, infected individuals, and the dead. It is assumed that the 
epidemic spreads so rapidly that changes in the population due to birth, death by 
other causes and migration can be ignored. It is also assumed that the disease 
is transmitted to healthy individuals at a rate proportional to the product of the 
healthy and infected populations. 

Let H denote the number of healthy individuals, J the number of infected 
individuals, and D the number of dead as functions of time. Time, here, will be 
measured in weeks. The basic assumptions stated in the previous paragraph lead 
to the model 

dH ar dD 


—_— = -— — _ —_ = ve 
Fr cHI, di cHI-ml, di mi, 


where c is the transmission rate of the disease to healthy individuals and m is the 
mortality rate of infected individuals. 

The model can be reduced to a single equation as follows. Divide the H 
equation by the D equation to obtain 


dH 


The solution of this equation is 

H = Hpexp(-cD/m), (4) 
where Ho is the initial number of healthy individuals. Next, sum the three original 
model equations to obtain 


d(H +1 + D) 


=0 => H+I+D=N 
dt 
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Figure 7.1 Time evolution of the number of healthy, the number of 
infected and the number of dead due to the spread of an epidemic. 


for some (constant) total population N. Solving this last expression for I, we find 
Il=N-—H-D. (5) 


Finally, substitute (4) into (5) and then substitute the resulting expression for [ 
into the original model equation for D. This yields 


“ =m(N — D— Hp exp(-—cD/m)]. (6) 
Once the number of dead has been computed from (6), H and I can be obtained 
from the algebraic equations (4) and (5). 

Suppose that in a remote village of 3000 people, 150 are initially infected with 
the disease and the other 2850 people are healthy. How many people will eventually 
die due to the disease? How long will it take for the disease to run its course? With 
m = 1.8 week! and ¢ = 0.001(person-week)—!, the time evolution of the three 
categories (healthy, infected and dead) is shown in Figure 7.1. The initial value 
problem for D was solved using Euler’s method with a step size of h = 0.1 weeks. 
Equilibrium seems to be reached in roughly eight weeks, with 2124 dead at that 
time. 
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EXERCISES 


Chapter 7 initial Value Problems of Ordinary Differential Equations 


For Exercises 1-6, apply Euler’s method to approximate the solution of the given initial 
value problem over the indicated interval int using the indicated number of time steps. 


1. 


#=tx®-2 (0<St<1), x0}=1, N=4 


2.0 +(4/t)a=t* (1<t<3), x(l)=1, N=5 
3. 

4.e¢°=(1+2*)/ft (1<t<4), 2(1)=0, N=5 
5, 
6. 


z= (sine -e')/eose (0<¢<1), 2(0)=0, N=3 


a= ~227-1 (0<t<1), 2(0)=0, N=4 
c= %(1-az)/(sinz) (1<t<2), 2(l)=2, N=3 


bor Exercises 7-10, apply Euler’s method to approximate the solution of the given 
injtial value problem over the indicated interval in t using the indicated number of time 


steps. 


Compare the approximate solution with the given exact solution, and compare 


the actual error with the theoretical error bound. When determining the Lipschitz 
coastant, consider the indicated set D. 


7. 


8. 


9. 


10. 


a =e/xe (0<t<2), 2(0)=1, N=4, a(t)=V2e—1, 
D={(t,2)| O<t<2, 221} 


v= -ttanz/(1+t?) (0<t<1), 2(0)=2/4, N=4, 
a(t)=sin™* /(2+2H), D={(t,2)|0<t<1,0<2< 71/4} 

ston (<t<4), 20) S1, Na4, 2G)=2e-' 42-1, 
D={(t,2)| 0<t<4, rE R} 

w=ac-t (0St<2), e()=2, N=4, a2(t)=eb +141, 
D={(t,2)| O0<t<4, rE R} 


Two different initial conditions are specified for the initial value problems in Exercises 
11-14. Compare the approximate solutions corresponding to each initial condition, and 
verify that the bound specified in equation (3) holds. 


1, 
12. 
13. 
14. 
15. 
16. 


17. 


a! = —2tar/(l+t?) (2<t<4), N=5, 2(2) = —5 versus 2(2) = —5.1 
g=sint—-z (0<¢<2), N=5, 2(0)=1 versus 2(0) = 1.1 
zi =%e'-2 (0<t<5), N=5, 2(0) =1 versus 2(0) =0.9 
ai=l+a2/t (1<t<6), N=5, 2(1) =1 versus 2(1) = 1.1 
Derive equation (3). 
Suppose Buler’s method has been used to obtain approximate values for the 
solution to the initial value problem y’ = f(t,y),y(a) =a att =t, and t = te. 
Would it be appropriate to approximate the solution for ¢ between t; and te 
using linear interpolation? Explain. 
Suppose we use Euler*s method to approximate the solution of the initial value 
problem y’ = f(t,y),y(@) = e@ over the interval a < t < b. Let wa(b) denote 
the approximation to y(b) obtained with a step size of h. Since the global error 
associated with Euler’s method is O(h), toward what value do we expect the 
expression 
wn(d) — Wpso(d) 

wns2(d) — Wrsa(d) 

to converge as A is reduced? 
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In Exercises 18-22, confirm that the global error associated with Euler’s method is O(h),. 


18. 
19. 
20. 
21. 
22. 
23, 


24, 


25. 


26, 


27. 


v =4tVe? 4+ 1/e (0<t<5), (0) =1, a(t) = (20? + V2)? -1 

te! 4 =the’ (1<t<2), 2(1)=0, x(t) =t4*(e’ -e) 

a =l-ate*z*? (0<t<09), 2(0)=0, x(t) =e 'tan(et—1) 

a =2t(x+1)/xe (0<t<2), 2x(0) = -2 

w=(t—tet+a7)\/(tz) (1<t<2), 2(1)=2 

(a) Consider the initial value problem 

dx x 
- 


—=2- <t< =2. 
G (1<t<6), 2(1)=2 


The exact solution of this problem is z(t) =~+1/t. With what rate does 
Euler’s method converge to this exact solution? 


(b) Repeat part (a), but change the initial condition to z(1) = 1. The exact 
solution in this case is z(t) = t¢. 


(c) Explain any difference in rate of convergence between parts (a) and (b). 
(a) Consider the initial value problem 


dex 
di 


1 


2 HORE ey #(0) = —5. 


=—(l+t+2)-(24+De-—2 


The exact solution of this problem is a(t) = ~t —- 1/(e" +1). With what 
rate does Euler's method converge to this exact solution? 

(b) Repeat part (a), but change the initial condition to 2(0) = —1. The exact 
solution in this case is z(t) = —t — 1. 

(c) Explain any difference in rate of convergence between parts (a) and (b). 

In the “Modeling the Spread of an Epidemic” problem, vary the parameters c 


and m. What effect does each parameter apparently have upon the dynamics of 
the epidemic? 


The liquid level in an accumulator for a pumped-fluid system satisfies the model 


dh 10.3 

ae 60) yee 7 esi) 

i + 0,002 (52 Ih+ ap) 1.17(1 + sin 3t) = 0.0308 
h(0) = 5.0, 


where h denotes the liquid level (measured in meters) and ¢ denotes time (mea- 
sured in minutes). How does the liquid level evolve in time? Estimate the 
amplitude and period of the oscillation in the liquid level. 


Consider the population model 


ee (i) 
- 1+ 22° 


The first term on the right-hand side is known as the logistic growth term. This 
term results from the assumption that for small population levels, the population 
will grow at a rate proportional to the current level, while, for large population 
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levels, limited resources will cause the growth rate to decrease and eventually 
become negative. The parameters r and k are called the natural growth rate 
of the population and the environmental carrying capacity, respectively. The 
second term on the right-hand side represents harvesting/ predation of the species 
by some other species (e.g., fish being caught by fishermen or insects being eaten 
by birds). 

(a) For r = 0.4 and k = 20, use Euler’s method to determine the eventual 

population level reached from an initial population of 2.44. 
(b) Repeat part (a), but with the initial population changed to 2.40. 


7.3. HIGHER-ORDER ONE-STEP METHODS: TAYLOR METHODS 


Although Euler’s method is both straightforward to develop and to implement, it is 
not very accurate. In particular, both the local truncation error, 7;, and the global 
discretization error, |y; —w;|, are only O(h). In this section, we will develop several 
higher-order one-step methods for scalar first-order initial value problems. 


Taylor Methods 


The most natural approach to develop a more accurate scheme than Buler’s method 
is to follow the derivation of Euler’s method, but to assume the true solution has 
more continuous derivatives and, accordingly, to retain more terms in the Taylor 
expansion. In particular, assume y has n + 1 continuous derivatives on [a,b] and 
expand in a Taylor series about t = ¢,: 


(t— 
2 


(¢-t)" (my , (¢-t%)"*" 


aE ty. 


t;)2 
4) yess 


y(t) = yt (t-te)y; + 


From the differential equation, we know that we can replace y; by f(t;,y,). Each 
higher-order derivative of y that appears in (1) can be replaced by the appropriate 
order total derivative (d/dt) of f. For instance, 


yy = eka ee = (¥ + zf)| 


2: 2 2 8 2 
= (Ge + 29g + Shr? + St + (8) f) (2) 


2 
= Sfew)| 
= t=t, 


etc. 


Evaluating the resulting expression at t = t;41, dropping the remainder term and 

replacing all y’s by w’s produces the one-step method 

Ani qr} 
ni dt} 


Wit — Wi 


f(t,y) 
h 


, (3) 


(ti ws) 


poet 


(te,wa) 


f (ti, wi) + , Eley) 


where h = t;41 — t;. Numerical methods derived in this manner are called Taylor 
methods. Note that Euler’s method is the Taylor method corresponding to n = 1. 
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The local truncation error associated with (3) is clearly O(h”). Equation (3) 
therefore represents the general nth-order Taylor method. Below, and in the ex- 
ercises, we will undertake a numerical investigation of the rate of convergence of 
the global error of Taylor methods. A theoretical analysis of the global error and 
stability of Taylor methods will be deferred until Section 7.6. 

While retaining more terms in the Taylor series provides a more accurate 
representation of the solution, it also produces numerical methods which require 
more work per time step. With initial value problems, as with rootfinding problems 
and numerical integration, the standard measure of work is the number of function 
evaluations. For instance, the second-order Taylor method, 

2 
Wi41 = Wi + Aft, wi) + FE ee cans) 
2 dt 
requires two function evaluations per time step, while the fourth-order Taylor 
method, 


ee 
6 dt? 


h? d, 
Win = wi thf (ti, wi) + ott wi) + 


4 43 
(ti, wa) + Ft, Ws); 
requires four function evaluations. Compare these counts with the one function 
evaluation per time step used by Euler’s method. 

This brings up a very important question. Is the extra expense of a higher- 
order Taylor method justified in terms of performance? If several methods of dif- 
ferent order are applied to the same initial value problem and the same step size is 
used for each method, then, of course, the higher-order methods will produce more 
accurate approximations. But what happens if we vary the step size from method 
to method in such a way that each method uses the same total number of func- 
tion evaluations? Will the higher-order methods still outperform the lower-order 
methods? Let’s find out. 


EXAMPLE 7.9 Euler’s Method versus the Second- and Fourth-Order 
Taylor Methods 


Consider the initial value problem 


e 2 (1<t<6), a(l)=1, 


ver yi 
eae 


whose exact solution is z(t) = (1+ Int). Figure 7.2 displays the base 10 logarithm 
of the absolute error in the approximate solutions obtained from Euler’s method, 
the second-order Taylor method, and the fourth-order Taylor method. Step sizes 
of A = 0.125, Ah = 0.25, and h = 0.5 were used for the three methods, respectively. 
These choices allowed each method the same number of function evaluations (40) 
to advance from t = 1 to t = 6. Clearly, even with larger step sizes, higher-order 
Taylor methods outperform their lower-order counterparts. 
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Figure 7.2 Logarithm of error in approximate solution to the initial 
value problem 2’ = 1+ 2/t, z(1) = 1 computed using Euler’s method 
(with h = 0.125), the second-order Taylor method (with h = 0.25), and 
the fourth-order Taylor method (with h = 0.5). 


Now, let’s take a closer look at the calculations involved in using higher- 
order Taylor methods. In the next two examples, pay particular attention to the 
derivatives of the right-hand side function f. 


EXAMPLE 7.10 Second-Order Taylor Method in Action 


The second-order Taylor method requires the evaluation of both the right-hand side 
function and its first derivative during each time step. For the initial value problem 


o =ltaft (1<t<6), 2(1)=1, 
f(t,2) =1+ 2/t, so, applying the quotient rule, 
df te’-~a tf-2 
ap tO) =i 2 = 2 ‘ 


Take h = 0.25, the same value used in the previous example. With tp = 1 
and wo = 1, it follows that 
1 
f(to,wo) = 1+“ =14—-=2; and 
to 1 


qj tof —w 1-2-1 
# to, 10) = pr My = 


1. 
t @ 
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Therefore, 
2 


h 
2 dt 


+ (0.25)(2) + ows = 1.53125, 


Wy = Wo + Af (to, wo) + = = (to, wo) 


Hence, y(1.25) = 1.53125, an approximation that is in error by 0.00232. In the next 
time step, we calculate, with t} = to + h = 1.25, 


wy 1.53125 
t =14+ > =14+ —— = 2.225; 
f(ti,wi1) =1+ - 1+ 135 225; and 
cine tl (1.25)(2.225) — 1.53125 _ 9g 
at 1,41) = e = 1.252 =U. 
Finally, 
h? d 
wa =r +hs(san) + = Lean) 
0.257 
= 1.53125 + (0.25)(2.225) + (0.8) = 2.1125. 


This approximation to y(1.5) is in error by roughly 0.00430. 


EXAMPLE 7.11 Fourth-Order Taylor Method in Action 


The fourth-order Taylor method requires the evaluation of f and its first three 
derivatives each time step. For our standard test problem, f(t,2) = 1+ 2/t, and 
we have already calculated 

df 


Fi er 


To obtain the second derivative of f, let’s first rewrite the first derivative as f/t — 
az/t?. Using the quotient rule twice yields 
aS ws tfi'—f wf — tz Lf af 
dt? t? 4 t 12 a : 
The third derivative of f is then given by 


3 f tf" — f gif =f , off sie _ ft 3f'  6f 6a 


poe i rr en er 
With h = 0.5 and tp = wo = 1, we calculate 
F (to, wo) = 2; F (to, 2) =1, 
ae : - 2) + ao =-l; and 
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Therefore, 


0.5? 0.5" 0.54 
wy = 1+ (0.5)(2) + af) F rare) + “pg (2) = 2.109375, 
This approximation to y(1.5) is in error by only 0.00118, which is roughly one- 
fourth the error produced by the second-order Taylor method. Calculations for the 
next time step produce 


d 
a? f d3 


pz ott we) = ~0.444444 Fev) = 0.592593 


and we = 3.388117. 
a er =, 


In the previous section, we found that the global error associated with Euler’s 
method was, like its local truncation error, O(h). Is the same thing true for higher- 
order Taylor methods? For example, the local truncation error of the second-order 
Taylor method is O(h?). Is the global error for this method also O(h?)? Is the 
global error of the fourth-order Taylor method O(h*)? As noted earlier, we will 
undertake a numerical investigation of global error here and defer a theoretical 
analysis to a later section. 


EXAMPLE 7.12 Rate of Convergence of Higher-Order Taylor Methods 


The following table below lists the absolute error, as a function of step size, in the 
approximate value for «(6), where z(t) = t(1 + Int) is the solution of the initial 
value problem 


dx x 
aquit; (sts8), s(j=t 
Approximate solutions were computed using both the second-order and the fourth- 
order Taylor methods. Observe the values in the error ratio columns. With each 
decrease by a factor of two in the step size, the error in the approximation obtained 
from. the second-order Taylor method drops by roughly a factor of 4, and the error in 
the approximation obtained from the fourth-order Taylor method drops by roughly 
a factor of 16. Hence, our numerical evidence suggests that the global error for the 
second-order method is O(h?), while the global error for the fourth-order method 
is O(h4), Further evidence will be gathered in the exercises. 


Second-Order Taylor Fourth-Order Taylor 
h Error Error Ratio Error Error Ratio 
1/2 1.187073e-01 5.800613e-03 
1/4  3.019225e-02 3.9317 3.354233e-04 17.2934 
1/8  7.583378e-03 3.9814 1.973828e-05 16.9935 
1/16 1.898111e-03 3.9952 1.19001 1e-06 16.5866 


(table continued on nert page) 
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1/32  4.746703e-04 3.9988 7.294264e-08 16.3143 
1/64 1.186765e-04 3.9997 4.513211e-09 16.1620 
1/128  2.966968e-05 3.9999 2.806431le-10 16.0817 
1/256 7.417455e-06 4.0000 1.745804e-11 16.0753 


Although Taylor methods are straightforward to derive, for n > 1 they all 
suffer the practical disadvantage of requiring the user to compute and supply the 
appropriate number of derivatives of the function f. As illustrated by equation (2) 
and the worked examples involving the second- and fourth-order methods, these 
computations become very cumbersome, very quickly. For this reason one would 
prefer to develop higher-order methods which involve evaluations of f only. We will 
derive such methods in the next section. 


An Application: Radiative Heat Transfer to a Thin Metal Plate 


This problem was adapted from Cutlip and Shacham [1]. A metal plate is to be heat 
treated by placing it into a high-temperature furnace. The plate will be suspended 
inside the furnace so that both sides may be rapidly heated by radiation from the 
furnace walls. We will assume that the interior of the furnace and the surfaces of 
the plate radiate as black bodies. We will also assume that the plate is sufficiently 
thin and the metal of sufficiently high thermal conductivity so that temperature 
variations across the thickness of the plate can be ignored. 

To model the time variation of the temperature of the plate, we perform a 
basic energy balance. This balance will consist of two components: the rate of 
change of the thermal energy stored in the plate and the rate at which radiation 
energy is absorbed. The total thermal energy stored within the plate, H,, is given 
by 

By = pV CpT, 
where p is the mass density, V the volume, c, the specific heat, and T the temper- 
ature of the plate. Treating the density and volume as constant, but taking into 
account the temperature dependence of the specific heat, it follows that 


di, d 


dT dc, dT 
= pV (oF +7) 


dt dT dt 
dep \ dT 


The rate at which radiation energy is absorbed by the plate, graa, is governed by 
the Stefan-Boltzmann law 


Grad = OA (Th — T*) ; 


where o = 5.676 x 1078 W/m?. K¢ is the Stefan-Boltzmann constant, A is the 
surface area of both sides of the plate, and Tr is the temperature of the furnace. 
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Equating dE,/dt with gaa, solving for dT'/dt, and using the fact that A/V = 2/d, 
where d is the thickness of the plate, yields 
GP le Te Tt (4) 
dt = pd Cp + Te : 

Suppose a copper plate of thickness d = 0.002 m, mass density p = 8933 
kg/m*, and initial temperature T(0) = 300 K is suspended in a furnace of temper- 
ature Tp = 1200 K. Standard thermodynamic tables (see, for example, Incropera 
and DeWitt [2]) list the following values for the specific heat of copper: 


Temperature (K) 300 400 600 800 1000 1200 
Specific Heat (J/kg-K) 385 397 417 433 451 480 


We will model this data using the least-squares regression line 
Cp(T) = 355.2 + 0.10047. 
Substituting the given values into (4) leads to 


ar Ti _ 74 

— = 6.354 x 10-9 ——_*¥ ____ 

dt : y 355.2 + 0.20087” 
together with the initial condition T(0) = 300. To improve the scaling of the prob- 
lern, introduce the nondimensional variable 6 = T/T. The initial value problem 
for @ becomes : 


dé 1-64 

= = 10.989 ————___—_ 

dt s 9805552 + 240.966’ 
Let’s use the second-order Taylor method to approximate the solution of this 


initial value problem. For this problem 


6(0) = 0.25. 


— 64 
F(t, 8) = 10.980 555 5 940.960" 


from which we calculate 


FF _so,9qp (885:2 + 240.960)(46° f) + 240.96(1 — O)F 
dt (355.2 + 240.960) 


With a step size of h = 0.25 seconds, time was advanced from ¢ = 0 to ¢ = 100 
seconds, and the results are displayed in Figure 7.3. Note that it takes roughly 90 
seconds for the temperature of the plate to reach equilibrium. 
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Figure 7.3 Temperature of thin copper plate suspended in a high- 
temperature furnace. 


For each of the following differential equations, identify the function f(t,z) and 
calculate df /dé. 


(a) 2 =e'/a (b) a! + ta = tz? 
(c) a! = te tt 1 (d) oc! =e + (14 Se')n+ 2? 


. For each of the following differential equations, identify the function f(£, a) and 


calculate df /dt, d?f/dt® and d® f /dt°. 
(a) 2 +2¢7 =1?-1 (b) xz’ =sint-z 
(c) 2 + #=74 (d) zc’ =2-t 


. Apply the second-order Taylor method to approximate the solution of the given 


initial value problem over the indicated interval in t using the indicated number 
of time steps. 


(a) v se/e (0<t<1), 2(0)=1, N=4 

(b) 2’ +t2=tz? (0<t<2), 2(0)=1/2, N=4 

(c) 2 =te™*'-1 (0<t<2), 2(0)=-1, N=3 

(a) co =e*4 (14+ $e)ot+2? (O<St<1), 2(0)=-1, N=2 


. Apply the third-order Taylor method to approximate the solution of the given 


initial value problem over the indicated interval in ¢ using the indicated number 
of time steps. 


(a) o' +22? =2?-1 (0<¢<1), 2(0)=0, N=2 
(b) & =sint—z (m<t<2n), a(m)=1, N=2 
(ce) oe += (1<t<2), 2(l)=1, N=4 
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(d) co =2z-t (O<t< 1), 2(0)=2, N=4 


. Repeat Exercise 4 with the fourth-order Taylor method. 
. Suppose we approximate the solution of the initial value problem y’ = f(t,y), 


y(a) = o@ over the interval a < ¢ < 6 with some numerical method. Let wy(b) 
denote the approximation to y(b) obtained with a step size of Ah. If the global 


error associated with the method is O(h*), toward what value do we expect the 
expression 


wn(b) — wasr(d) 
waj2(b) — wrzsa(d) 


to converge as h is reduced? 


. Use each of the following initial value problems to demonstrate that the global 


error associated with the second-order Taylor method is O(h’). 

(a) we =e/e (0<¢<1), 2(0)=1, x(t) = V2e*-1 

(b) a’ tte =te? (0<t<2), 2(0)=1/2, af) =(1te%/%)4 

(c) 2’ =te™t#-1 (0<¢<2), of0)=-1, a(t) =-t—In(e — 27/2) 
(d) a =e*4+(1+3e)a+2” (O<t< 1), 2(0)=-1 


. Use each of the following initial value problems to demonstrate that the global 


error associated with the third-order ‘Taylor method is O(h’). 

(a) o' +207 =#2-1 (0<¢<1), 2(0)=0 

(b) x’ =sint—2 (r<t<2n), a(n) =1, x(t) =4e7 + dsint — 3 cost 
(ce) 2 + Sa US<t<2), a)=1, e@)= gr +r” 

(d) a =a2-t (O<t<1), a(0)=2, a(t) =e t+t+1 


. Repeat Exercise 8 with the fourth-order Taylor method. 
10. 


Compare the performance of Euler’s method, the second-order Taylor method 
and the fourth-order Taylor method on the initial value problem 


=— <t<s5 0) =1. 
Tap OStS8), 200) 


The exact solution for this problem is z(t) = V## +1. 
Repeat Exercise 10 using the initial value problem 


whose exact solution is z(t) = —1/(t1n(2t)). 
(a) Consider the initial value problem 


ae 9-2 <t<s), 2(1)=2 


The exact solution of this problem is x(t) = t? +1/t. With what rate does 
the second-order Taylor method converge to this exact solution? 

(b) Repeat part (a), but change the initial condition to a(l) = 1. The exact 
solution in this case is a(t) = 0”. 


13. 
14. 


15. 
16, 


17. 


18. 


19. 
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(c) Explain any difference in rate of convergence between parts (a) and (b). 
Repeat Exercise 12 using the fourth-order Taylor method. 
(a) Consider the initial value problem 


Ga ttt) — (4+ 1-2 (0<t <3), (0) = -5. 


The exact solution of this problem is z(t) = —t —1/(e’ +1). With what 
rate does the second-order Taylor method converge to this exact solution? 
(b) Repeat part (a), but change the initial condition to 2(0) = —1. The exact 
solution in this case is x(t) = —t—1. 
(c) Explain any difference in rate of convergence between parts (a) and (b). 
Repeat Exercise 14 using the fourth-order Taylor method. 


(a) Consider the initial value problem 
dz 2 Ax 
Kav U<t<6), afl) =2. 
= (Sts), x(1) 
The exact solution of this problem is z(t) = i? +41/t". With what rate does 
the fourth-order Taylor method converge to this exact solution? 
(b) Repeat part (a), but change the initial condition to x(1) = 1. The exact 
solution in this case is 2(t) = t°. 
(c) Explain any difference in rate of convergence between parts (a) and (b). 


Suppose a plate of AISI 304 stainless steel has been suspended in a furnace of 
temperature T- = 1500 K. The plate has a thickness of d = 0.002 m, a mass 
density of p = 7900 kg/m? and an initial temperature of T(0) = 300 K. The 
specific heat data for AISI 304 is provided below. Determine how long it takes 
for the temperature of the plate to come into equilibrium with the furnace. 
Temperature (K) 300 400 600 800 1000 1200 1500 
Specific Heat (J/kg-K) 477 515 557 582 611 640 682 


Suppose a plate of tungsten has been suspended in a furnace of temperature 
Tr = 1500 K. The plate has a thickness of d = 0.002 m, a mass density of 
p = 19300 kg/m? and an initial temperature of T(0) = 300 K. The specific 
heat data for tungsten is provided below. Determine how long it takes for the 
temperature of the plate to come into equilibrium with the furnace. 


Temperature (K) 300 400 600 800 1000 1200 1500 
Specific Heat (J/kg-K) 132 137 142 145 148 152 157 


Recall the “Modeling the Spread of an Epidemic” problem of Section 7.2. Mur- 
ray [3] reports the following parameter values for a flu epidemic that struck a 
boys boarding school in the early part of 1978: N = 763, Ho = 762, D(0) = 
0,c = 0.00218, and m = 0.44036. Simulate 15 days of the spread of this 
epidemic, and produce a plot like Figure 7.1. In this situation, D denotes the 
number of boys confined to bed as a result of the Mu. 
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7.4 RUNGE-KUTTA METHODS 


As indicated in the previous section, the biggest disadvantage associated with Tay- 
lor methods of order n > 1 is the need to compute derivatives of the right-hand 
sice function f. In this section, we will develop a class of higher-order one-step 
methods that use values of f exclusively. These techniques, collectively, are called 
Runge-Kutta methods. 

The fundamental idea behind the development of Runge-Kutta methods is to 
approximate, to appropriate order, the right-hand side of the Taylor method. In 
formulating this approximation, only evaluations of f are to be used. Derivatives 
of f are not to be included. For example, to develop a third-order Runge-Kutta 
method, we would attempt to determine an O(h?) approximation to the right-hand 
side of the third-order Taylor method. 


Second-Order Runge-Kutta Methods 


We will develop second-order Runge-Kutta methods here to illustrate the process. 
Consider the explicit one-step method 


Wi4) — Wi 


h = Of, ti, wi, h) (1) 


with 
of, Ly, h) = arf (t, y) oe agf(t + Qe,y + dof (t, y)). (2) 
Below, our objective will be to determine values for the parameters a1, a2, 2, 
and é so that f(f,t,y,h) provides an O(A*) approximation to the right-hand side 
of the O(h?) Taylor method. 
In practice, the numerical method given by equations (1) and (2) would be 
implemented in two stages. In the first stage, we calculate 


th = wi + bof (ti, wi); 


which is just Euler’s method with h = 62. Therefore, w ~ y(t; + 52). In the second 
stage, we determine w+; from the equation 


Wi41 — Wi 
h 


To avoid an unnecessary function evaluation, the value of f(ti,w:) from the first 
stage should be saved and reused in the second stage. 

Since w * y(t; + 42), the last term in the equation for wi+1 suggests that we 
should select a2 = 59. Furthermore, since the linear combination of f values in 
the w;41 equation has the appearance of a weighted average, we should also expect 
to find that a, + a2 = 1. To confirm our intuition (and to determine what other 
relationships must hold between the parameters), let’s continue with our analysis. 

To proceed with our development, we will need the following simplified version 
of Taylor’s theorem in two variables. For a general version of the theorem and a 
proof, see, for example, Douglass [A]. 


= an f (ti, Wi) + anf (t; + a2, th). 
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Theorem. Let f(t,y) and all of its first and second partial derivatives be 
continuous on D = {(t,y)ja<t<b,c<y < a} and let (t,y) € D. For any 
At and Ay such that (£+ At, y+ Ay) € D, there exists € between ¢ and t+ At 
and 7 between y and y + Ay with 


F{l+ Mt + Oy) = fea) + [aiZEey) + ie 
[eae n+ SA nl. 


heen + atdueE y 


The right-hand side of the second-order Taylor method is 


a) 


fa) t+ Bru) = fey) +4 | oy 


and, using Taylor’s theorem in two variables, the right-hand side for the second- 
order Runge-Kutta method can be written as 


Of, ty, h) = a, f(t,y) + deft + OQ, Y + dof (Et, y)) 


=aif(t,y) + ag re y) + a a + dof (t, ee of fn] 


a () 
= (ay + a2) f(t, y) + anon oe + anda f(t, wi +aoR,, 


where 


2 
B= BF Gn) + oabaflt ae (Em) + BSE). 


Notice that every term in R, involves a second derivative of f. Equating the 
coefficients of like terms between the Taylor method and the Runge-Kutta method 
provides the equations 


There are therefore an infinite number of second-order Runge-Kutta methods of 
the type considered. The most common second-order Runge-Kutta methods are as 
follows: 


Modified Euler method (a; = 0, a2 = 1, a2 = dg = h/2) 


= wi + Bf (ti, ws) Euler’s method with step h/2 
Win = wi thf (t+ 4,%) Midpoint integration 
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Heun method (a; = a2 = 1/2, ag = dp = h) 


= w; +hf (ts, ws) Euler’s method with step h 
Wiss = wi + 2 [f (ti, wi) + f(t +h, @)] Trapezoidal integration 


Optimal RK2 method (a, = 1/4, a2 = 3/4, a2 = 6) = 2h/3) 


= wz + BF (ti, wi) 
Wi4) = Wi + 4 F(t:, wi) + ah f (ts at 2 wo) 


The last of these methods is optimal in the sense that the choice az = 3/4 
minimizes the numerical coefficient on the truncation error (see, for example, Ral- 
ston [2]). 


EXAMPLE 7.13 A Second-Order Runge-Kutta Method in Action 


All second-order Runge-Kutta methods operate in fundamentally the same two- 
stage fashion, so here we will demonstrate only the Heun method. Consider the 
initial value problem 


dz 
dt 


and let the step size be h = 0.5. 
For the first time step, we have tp = wo = 1. Using these values in the first 
stage of the calculation, we find 


=1+- (1<t<6), 2(1)=1, 


Now, in the second stage, we obtain 


ftp +h,w) =14 a = 2.333333 


and 05 
wy =lt cy [2+ 2.333333] = 2.083333. 


To perform the next time step, first set t) = to +h = 1.5. The first stage of 
the time step then yields 


f(ti,wi) =1+ os = 2.388889 and 


ti = 2.083333 + (0.5) (2.388889) = 3.277778. 


In the second stage of the time step, we calculate 


3.277778 
2 


f(t; +hA,w)=1+ = 2.638889 
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and 
we = 2.083333 + “ [2.388889 + 2.638889] = 3.340278. 


Continuing in this fashion, we obtain the following results, with the value of the 
exact solution listed for comparison. 


t; Wi x(t,) |a(ts) — wu; 
1.0 1.000000 1.000000 

1.6 2.083333 2.108198 0.024864 
2.0 3.340278 3.386294 0.046017 
2.5 4.725347 4.790727 0.065380 
3.0 6.212083 6.295837 0.083754 
3.5 7.783145 7.884670 0.101526 
4.0 9.426273 9.545177 0.118905 
4.5 11.132335 11.268348 0.136014 
5.0 12.894261 13.047190 0.152929 
5.5 14.706414 14.876115 0.169701 
6.0 16.564194 16.750557 0.186363 


We know that, by construction, the local truncation error associated with 
a second-order Runge-Kutta method is O(h?). What about the global error? It 
turns out that, like the second-order Taylor method, the global error associated with 
second-order Runge-Kutta methods is also O(h?}. The theoretical justification for 
this statement will be provided in Section 7.6. Numerical investigations of the global 
error for second-order Runge-Kutta methods will be carried out in the exercises. 

There is one more issue we wish to address relating to second-order Runge- 
Kutta methods. How do these methods compare to one another in terms of per- 
formance? Further, how do these methods compare to the Taylor method from 
which they were derived? Figure 7.4 displays the logarithm of the absolute error 
in the approximate solutions obtained from the modified Euler method, the Heun 
method, the optimal RK2 method, and the second-order Taylor method. The top 
graph corresponds to the initial value problem 


dx x 
“= - <t< 1=1 
di Ea (1<t<6), x(1) ' 
and the bottom graph corresponds to the initial value problem 
dz t 


- (0<¢#<5), 2(0)=1. 


For both initial value problems and for all numerical methods, a step size of h = 0.05 
was used. 

Observe that for the first initial value problem, there is not much more than 
half an order of magnitude difference between the most accurate solution (ob- 
tained from the modified Euler method) and the least accurate solution (the Heun 
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Figure 7.4 Comparison between second-order Runge-Kutta methods 
and the second-order Taylor method. The top graph corresponds to the 
initial value problem 2’ = 1+ 2/t, 2(1) = 1, while the bottom graph 
corresponds to 2’ = t/z, 2(0) = 1. A step size of h = 0.05 was used 
for all methods. In the top graph, the errors generated by the Taylor 
method and the optimal RK2 method are indistinguishable. 


method). The errors generated by the Taylor method and the optimal RK2 method 
are indistinguishable to the resolution of the graph. For the second initial value 
problem, the Taylor method now produces the least accurate solution, followed by 
the modified Euler method, the optimal RK2 method, and the Heun method. The 
errors generated by the first three methods are again within roughly half an order 
of magnitude of one another; however, the errors generated by the Heun method 
are more than an order of magnitude smaller than any of the other errors. 

So what’s the moral of this story? Generally, any of the second-order methods 
we have considered will produce results that are roughly as accurate as any of 
the other second-order methods. For a particular problem, one technique may 
significantly outperform its counterparts, as was the case with the Heun method on 
the second initial value problem considered above, but there is no theory to indicate 
when this is going to happen. There is also no theory to indicate which second-order 
method will produce the most accurate approximation for a given problem. 


Classical Fourth-Order Runge-Kutta Method 


By far, the most common Runge-Kutta method is the classical fourth-order scheme. 
(Like the second-order methods we derived earlier in this section, there is more than 
one fourth-order Runge-Kutta method.) The classical fourth-order scheme updates 
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the approximate solution at each time step according to the formula 


1 
Wit, = Wit g (et + 2ko + 2k + ka), 
where 
ky = hf (ti, wi) 


kg =hf («+ p+) 


2 2 
—— 
a= np (t+ gam + 2) 


kg =hf(t; +h, w; + kg). 


If f is a function of ¢ only, this scheme reduces to Simpson’s rule of integration. Note 
that this method requires four function evaluations per time step. Experimental 
verification of the fourth-order accuracy of this method is left as an exercise. 


EXAMPLE 7.14 Fourth-Order Runge Kutta Method in Action 
Let’s approximate the solution of the initial value problem 


dz x 

—=14+- (<i< 1j=1 

Saltt (L<t<6), a(t) 
using the classical fourth-order Runge-Kutta method with a step size of h = 1. For 
the first time step, with tg = wo = 1, we calculate ky, kz, k3, and k4 as follows: 


ky = hf (to, wo) = f(1,1) = 2 

ky = hf (to + h/2, wo + ky/2) = f(1.5,2) = 2.333333 

kg = hf (to + h/2, wo + ko/2) = f(1.5, 2.166667) = 2.444444 
ka = hf (to + h, wy + kz) = f(2, 3.444444) = 2.722222. 


From here, we find that 


1 
wy = 1+ & [2 + 2(2.333333) + 2(2.444444) + 2.722222] = 3.379630. 


Continuing in this fashion, we obtain the results listed below, with the value of the 
exact solution included for comparison. 


ti Wi 2x(t,) jx(t;) = wi| 
1.0 1.000000 1.000000 

2.0 3.379630 3.386294 0.006665 
3.0 6.285000 6.295837 0.010837 
4.0 9.530510 9.545177 0.014667 
5.0 13.028776 13.047190 0.018414 
6.0 16.728424 16.750557 0.022133 
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Figure 7.6 Logarithm of error in approximate solution to the initial 
value problem ¢’ = 1+ a/t, 2(1) = 1 computed using the classical 
fourth-order Runge-Kutta method (RK4, with k = 0.1), the modified 
Buler method (with A = 0.05), and Euler’s method (with Ah = 0.025). 


Let’s perform a comparison between the various methods which have been 
introduced thus far. In particular, let’s apply Euler’s method, the madified Eu- 
ler method, and the classical fourth-order Runge-Kutta method (which we subse- 
quently refer to as RK4)} to the standard test problem 


dx £ 
dé t 
Step sizes of h = 0.1, A = 0.05, and A = 0.025 were used for the RK4 method, 
the modified Euler method and Euler’s method, respectively, so as to allow each 
methed exactly four function evaluations for each advancement of the solution by 
At =0.1. The logarithm of the absolute error in each computed approximation is 
shown in Figure 7.5. Even with the use of a smaller step size, the errors generated by 
Euler’s method are about two orders of magnitude larger than the errors generated 
by the modified Euler method, which in turn are about two orders of magnitude 
larger than the errors generated by the RK4 method. A comparison between the 
fourth-order Taylor method and the RK4 method is left as an exercise. 


(1<t<6), 2(1)=1. 


General Explicit Runge-Kutta Methods 


For completeness, it is worth noting that the Runge-Kutta methods we have dis- 
cussed are special instances of the general s-stage explicit Runge-Kutta method. 
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This general scheme can be expressed in the form 


gt 
ky = hf [uth + Sai J =1,2,3,...,8 
i=l 
3 
Wit. = wt :s by ky, 
j=l 
where c) = 0. The coefficients a; are collectively referred to as the RK matriz. 
The 5; are called the RA weights, and the c; are called the RK nodes. The RK 
matrix, weights, and nodes are often displayed graphically in the RK tableau: 
Cy 
C2 | @21 
C3 | 43,1 @3,2 


Cs | Gs) G32 77 Oss—] 
by bg vt Og bs 
The tableaus for the methods presented in this section are 
0 0 0 
2| 2 
2 | 3 a 3/3 
oS 23 co 
Modified Euler Heun Optimal RK2 
6 
1 
2 
i 
2 
1 
at 1 
é 
Classical RK4 


For a detailed development of general Runge-Kutta methods requiring a mod- 
est theoretical background, consult, Lambert (3]. More advanced treatments can be 
found in Butcher [4] and Hairer, Norsett, and Wanner (5). 


An Application: A Model for a Genetic Switch 


A genetic switch is a biochemical mechanism that governs whether a particular 
protein product of a cell (e.g., a pigment} is synthesized or not. The following 
initial value problem has been proposed as a model for a genetic switch: 

dg 


2 
Siga's _f 2 
gy = 87 L519 + 3.08, g(0) = 0. 
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The variable g denotes the concentration of the protein product, and the parame- 
ter 3 denotes the concentration of the chemical that activates the gene to produce 
the protein. The second term on the right-hand side of the differential equation 
is based on the assumption that, once produced, the protein naturally decays at a 
rate proportional to its concentration. The final term in the differential equation 
models a positive feedback effect that the protein exerts on its formation. 

To be considered a reasonably accurate model for a genetic switch, this initial 
value problem must capture two important phenomena. The first is known as the 
threshold effect. This means that there must exist a critical, or threshold, value 
of the parameter s such that for values of s below the threshold, the equilibrium 
concentration of the gene remains near zero, but for values of s above the threshold, 
the equilibrium gene concentration jumps to a higher level. In other words, there 
must be a jump discontinuity in the equilibrium value of g as a function of the 
parameter. When in the higher equilibrium level, the gene is considered to be 
“on.” The second phenomenon that the initial value problem must capture is a 
hysteresis effect. Once the gene concentration has reached the “on” state, if the 
parameter value is set to zero, the gene concentration should approach some fixed, 
nonzero level—the gene stays “on.” Here, we will examine the threshold effect. An 
examination. of the hysteresis effect will be left as an exercise. 

To determine whether the given model exhibits the threshold effect, we need 
to establish that the equilibrium gene concentration jumps from a low value to 
a high value as the parameter s is varied. Figure 7.6 displays the approximate 
solution to the model initial value problem for four different values of s: s = 0.1, 
g = 0.2, s = 0.3, and s = 0.4. The classical fourth-order Runge-Kutta method 
was used with a step size of h = 0.2. There is a definite jump in equilibrium gene 
concentration as the parameter is changed from s = 0.2 to s = 0.3. To be certain 
that the equilibrium concentration doesn’t vary continuously from the lower level 
to the higher level, we examine other parameter values between 0.2 and 0.3. The 
approximate solutions for s = 0.2, s = 0.202, s = 0.204, and s = 0.206 are shown 
in Figure 7.7. Again, RK4 with a step size of 0.2 was used. There is no doubt that 
this initial value problem exhibits the threshold effect. Furthermore, the critical 
value for s is somewhere between 0.202 and 0.204. 
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Figure 7.6 Approximate solutions to genetic switch model initial value 
problem for different values of the control parameter s. Note the jump 
in the equilibrium value of the gene concentration as s changes from 0.2 
to 0.3. 
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Figure 7.7 Approximate solutions to genetic switch model initial value 
problem for different values of the control parameter s. Note the jump in 
the equilibrium value of the gene concentration as s changes from 0.202 
to 0.204. 
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EXERCISES 


Chapter 7 Initial Value Problems of Ordinary Differential Equations 


For the differential equations in Exercises 1-8, write out the equation for Wi+1 explicitly 
in terms of h, t;, and w; for the modified Euler method, the Heun method, and the 
optimal RK4 method. Provide an explanation for the peculiar phenomena you observe 
in the three equations for w+. 


1. 


- 


13. 


i 
zx=t-2 


22 = Ra — 2t 
3, 
4, Apply the modified Euler method to approximate the solution of the given initial 


v= 3t+2~4e 


value problem over the indicated interval in t using the indicated number of time 
steps. 


(a) ce =tx?-2 (0<t<1), 2(0)=1, N=4 

(b) 2’ +(4/t)e=t* (1St<3), 2(l)=1, N=5 

(c) 2’ =(sinz—-e')/cosr (0<t<1), 2(0)=0, N=3 
(d) c' =(14+a*)/t (1<t<4), 2()=0, NaS 

(e) 2 =t?-2?-1 (O<t<1), 2(0)=0, N=4 
(f) 2 =21-a2)/(sinz) (1<ti<2), a(1)=2, N=3 


» Repeat Exercise 4 using the Heun method. 
. Repeat Exercise 4 using the optimal RK2 method. 
. Repeat Exercise 4 using the classical fourth-order Runge-Kutta method, but take 


N =2 in each case. 


Use each of the following initial value problems to demonstrate that the global 
error associated with the modified Euler method is O(h?). 


(a) ae =4tVe? + 1/2 (0<t<5), 2(0)=1, a(t) = J(2024+V2)2-1 
(b) te’ 4a =the’ (1<t<2), x(l)=0, x(t) =t*(eb—e) 

(c) & =1—-r+e%x? (0<t<09), x(0)=0, x(t) =e7*tan(e’~1) 
(d) 2 =2t(e+l/e (<t<2), 2(0)=-2 

(e) a! = (#2 -te+27)/(tz) (1<t<2), x(1)=2 


. Repeat Exercise 8 for the Heun method. 
. Repeat Exercise 8 for the optimal RK2 method. 
. Select an arbitrary value for ag between 0 and 1 (but not 1/2, 3/4 or 1), and 


implement the resulting second-order Runge-Kutta scheme. Numerically verify 
the global error is O(h“) using the initial value problems from Exercise 8. 


. Numerically demonstrate the global error of the classical fourth-order Runge 


Kutta scheme is O(h*) using the initial value problems from Exercise 8. 
Nystrom’s method is a three-stage Runge Kutta method whose tableau is 


lesics wn Oo 


— 


bi CO cleo 
Co|09] colto 


on}c3| 


14. 


15. 


16. 


17. 


18. 


19. 
20, 
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Implement this scheme and numerically determine the rate of convergence of the 
global error using the initial value problems from Exercise 8. 


Kutta’s method is a four-stage Runge-Kutta method whose tableau is 


rR wily wl © 


|} 1-1 1 
1 3 3 1 
8 8 8B 8B 
Implement this scheme and numerically determine the rate of convergence of the 
global error using the initial value problems from Exercise 8. 


Compare the performance of Euler’s method, one of the second-order Runge- 
Kutta methods, and the classical fourth-order Runge-Kutta method on the initial 
value problem 


= <t< =], 
oe (O<t<5), 2z(0)=1 


The exact solution for this problem is z(t) = Vf? + 1. 
Repeat Exercise 15 using the initial value problem 


dx tz*—2 1 
= <t<5 j)) ae 
= OS tS5), a(t) 


whose exact solution is £(t) = —1/(tIn(2¢)). 


Compare the performance of the fourth-order Taylor method and the classical 
fourth-order Runge-Kutta method on each of the following initial value prob- 
lems. 


(a) ce) =1lt2/t (<t<6), «e(=1, x(t) =t.+Ine) 
(b) 2! =t/e (0<t<5), c(0)=1, a(t) =VU4+1 
(c) 2 =(te*-a)/t (1<t<5), 2(1)=-1/in2, x(t) = -1/(tIn(2#)) 


(a) Consider the initial value problem 


The exact solution of this problem is x(f) = t? + 1/t. With what rate does 
the modified Euler method converge to this exact solution? 

(b) Repeat part (a), but change the initial condition to x(1) = 1. The exact 
solution in this case is x(t) = ¢?. 

(c) Explain any difference in rate of convergence between parts (a) and (b). 

Repeat Exercise 18 using the classical fourth-order Runge-Kutta method. 

(a) Consider the initial value problem 


B a-(1+t+0)- Qt )e-2 (0<¢t<3), 2(0)=--. 
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22. 


23. 


24. 


26. 
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The exact solution of this problem is z(t) = -t-1/(e'+ 1). With what 
rate does the optimal RK2 method converge to this exact solution? 

(b) Repeat part (a), but change the initial condition to (0) = —1. The exact 
solution in this case is a(t) = -t— 1. 

(c) Explain any difference in rate of convergence between parts (a) and (b). 

Repeat Exercise 20 using the classical fourth-order Runge-Kutta method. 

(a) Consider the initial value problem 


The exact solution of this problem is x(t) = #° + 1/t*. With what rate 
does the classical fourth-order Runge-Kutta method converge to this exact 
solution? 

(b) Repeat part (a), but change the initial condition to z(1) = 1. The exact 
solution in this case is z(t) = t°. 

(c) Explain any difference in rate of convergence between parts (a) and (b). 


Reexamine the genetic switch problem 


dg _ gy 


paying attention to the hysteresis effect. Use initial conditions g(0) = 1.8, 9(0) = 
1.68, and g{0) = 1.57 (roughly the equilibrium values attained in Figures 7.6 
and 7.7 and set s = 0. What new equilibrium level is reached? Does the gene 
stay “on” even after the parameter s is set to zero? 

The following initial value problem has been proposed as a model for a genetic 
switch: 


dg _ 9 _ 
eee 2.435 ri 1.619, (0) =0. 


Does this initial value problem exhibit the threshold effect? 


. Recall the “Radiative Heat Transfer to a Thin Metal Plate” problem from Sec- 


tion 7.3. Suppose a plate of chromium has been suspended in a furnace of 
temperature Tp = 1500 K. The plate has a thickness of d = 0.002 m, a mass 
density of p = 7160 kg/m® and an initial temperature of T(0) = 300 K. The 
specific heat data for chromium is provided below. Determine how long it takes 
for the temperature of the plate to come into equilibrium with the furnace. 


Temperature (K) 300 400 600 800 1000 1200 1500 
Specific Heat (J/kg -K}) 449 484 542 581 616 682 779 


Repeat Exercise 25 for a silver plate of thickness d = 0.002 m, mass density 

p = 10500 kg/m? and initial temperature T(0) = 300 K suspended in a furnace 

of temperature Tr = 1200 K. The specific heat data for silver is provided below. 
Temperature (K) 300 400 600 800 1000 1200 
Specific Heat (J/kg K) 235 239 250 262 277 292 
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7.5 MULTISTEP METHODS 
Recall from the introduction to this chapter that the general form for a linear 
m-step multistep method is 
Wit —~ A1We — A2QWi-1 7 +1 7 AmWit1-m 
hi ~ 
bof (tina, Wig) + di f(t, wi) + bof (tir, wit) t+++ + bin f (tig ms Wig1—-m)- 


When bp = 0, the method is said to be explicit; otherwise, it is said to be implicit. 
We will discuss the advantages and disadvantages of explicit versus implicit methods 
later in this section. 

Note that there is one immediate obstacle to the practical implementation of 
a multistep method—obtaining starting values for the difference equation. Unlike 
a one-step method, for which knowledge of wo is sufficient to begin calculations, 
multistep methods require multiple starting values. For example, since a four- 
step method makes use of w;, wii, wi-2, and w;—3, the starting values wo, w1, 
we, and w3 would be needed to begin calculations. Any starting values needed 
beyond wo are best determined using a one-step method of appropriate order. By 
this we mean that for a fourth-order multistep method, starting values should be 
determined using a fourth-order one-step method such as the classical fourth-order 
Runge-Kutta method. 

Once the starting values have been obtained, however, multistep methods 
offer the advantage of requiring just one new function evaluation per time step, 
regardless of the order of approximation of the method. Compare this with the one- 
step methods we studied in Sections 7.2, 7.3, and 7.4: The first-order method used 
one function evaluation, the second-order methods used two function evaluations, 
and the fourth-order methods used four new function evaluations per time step. 

In this section, we will focus on the derivation of two special classes of multi- 
step methods. First we will derive the so-called Adams-Bashforth methods, which 
are explicit methods. Next, the derivation of the implicit Adams-Moulton methods 
will be considered. Predictor-Corrector methods, which are a special combination. 
of explicit and implicit methods, will also be discussed. 


Adams-Bashforth Methods 


The general procedure for deriving an m-step Adams-Bashforth method is rather 
straightforward. Throughout, we will assume a uniform discretization in the t- 
domain; that is, we define t; = a+ ih, where h = (b— a)/N for some positive 
integer N. We begin by integrating both sides of the model differential equation 


y'(t) = f(t, y(d)) 


from ¢ = t; to t = t44,. This yields the equation 


tia 
ulti) ~ y(t) = / flesylt)) dt. 


z 
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Next, we write 
FE, y@)) = Pm-1(t) + Rin-i(t), 


where 


Prat) = 0 Dm 5S (tis—9. (bias) 
j=l 
is the Lagrange form of the polynomial of degree at most m — 1 that interpolates 
f at the m points t;, &j—-1, ty-2,-..,tizi-m and 


im) x 
Rimaa(t) = SVD The — taps 5) 


j=1 


is the corresponding remainder term. As with the interpolation-based differentia- 
tion and integration formulas we derived in Chapter 6, the Lagrange form of the 
interpolating polynomial is chosen here because of its clear and explicit dependence 
on the function values at the interpolating points. The integrals of the Lagrange 
polynomials, Lm—1,;(t), provide the values of the coefficients, b;, on the right-hand 
side of the difference equation, while the integral of the remainder term provides 
the truncation error term. Upon dropping the truncation error term and replacing 
all y’s with w’s, derivation of the method is complete. 

Let’s demonstrate this procedure by deriving the 2-step Adams-Bashforth 
method. Since m = 2, we write 
t—t- 1 t~—& Pe y@) 

2 


ti, ys) 4 ti, Yi 
aot, ft t py eam i)+ 


é~ ti)(t - ty-1). 


f(t, y)) = 


Integrating the Lagrange polynomials yields 


bi+n 3h 
b= [| a tt gon | (s+1)ds= > 
ty i %i-1 


tig  #; 1 
b= | aati at = nf aes 
i uit 0 2 


The change of variable t = t; +sh was introduced to simplify the calculations. Using 
this same change of variable in and applying the Weighted Mean-Value Theorem 
for integrals to the integral of the remainder term produces 


tiga pn hs fil a é : Bhs mie 
j CUD a(t a)ae = POEM) s(s +1) ds = =—-y'"(é). 


and 


4 


ay 


In the final step, the differential equation was used to replace f” by y’”. Gathering 


everything together and rearranging into standard form, we have 


yisi yi 3 Seas vl - ; ON ne 
ge pe eel) af (tia Yin) + 124 (€). 
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Therefore, the two-step Adams-Bashforth method is 


Wii Ww 3 1 
eat a = gf (ti) — 5f(ti-1, wi-1) 


with local truncation error 


n= Hy (8) = 01K) 


Note that the two-step method is second-order. To implement this scheme, values 
are needed for both wo and w). The initial condition is, of course, assigned to wo, 
while w, can be obtained using any second-order one-step method. 

Proceeding in a similar manner, the three-step Adams-Bashforth method 


4 ~~ Uy 2 
wae = Gaus - Sf (be, t05-1) + 


12 = f(t 2, Wi- 2) 
_ 3he 


12 
ete (é) = O(1) 
and the four-step Adams-Bashforth method 


Wit. > Wy 55 59 


hm ag tea) — 3G 


24 — ftir J) Wi- +5 ait i~-2, Wee eee (= 3, W5_-3) 


24 
251h4 


8 = O18!) 


ee 
can be derived. The starting values, w, and woe, for the three-step method should 
be obtained from a third-order one-step method; the values w), wz, and wa for 
the four-step method should be obtained from a fourth-order one-step method. In 
general, the m-step Adams-Bashforth method has local truncation error that is 
O(h™), and therefore the method is of order m. Numerical verification that the 
global error of the m-step Adams-Bashforth method is also O(h™) will be considered 
in the exercises. 


EXAMPLE 7.15 The Two-Step Adams-Bashforth Method in Action 
Consider our standard test problem 


Lge 
dt t 


Before we can begin calculating with the two-step Adams-Bashforth method, we 
need a value for w,;. We can use any second-order one-step method to do this, 
but here we will use the second-order Taylor method. In practice, we would not 
choose this one-step method since it requires the derivative of the right-hand side 
function, whereas the Adams-Bashforth method does not. We have chosen to use 
the Taylor method in this example simply to facilitate comparisons between the 
Adams-Bashforth and the Taylor methods. 


(1<t<6), 2(1)=1. 
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For the first time step, we have to = wo = 1, from which it follows that 
J (to, wo) = 2. The derivative of f is 


f'(i,2) = Meda s => f'(to, up) = 2 1, 


Using a step size of h = 0.5, the second-order Taylor method produces 


(0.5)? 
2 


w, = 1+ (0.5)(2) + 


(1) = 2.125. 


For all remaining time steps we use the difference equation for the two-step Adams- 
Bashforth method. In order to proceed as efficiently as possible, we save the value 
of f(to,wo) and calculate 


2.125 
f(,w)=1+ ris 2.416667. 


It then follows that 
h 
Wo = Wy + 9 [Sf (ti, wr) = f (to, wo)| 
= 2.125 + 0.25 [3(2.416667) - 2] = 3.4375. 
At this point, the value of f(to, wo) is no longer needed, but we will need f(t,,w1) 
for the next time step. We therefore save the value of f(t1,w1) in place of f(to, wo) 
and then calculate f(te, we) = 2.71875 and 
w3 = 3.4375 + 0.25 [3(2.71875) — 2.416667] = 4.872396. 


Continuing in this fashion, we obtain the following results. 


t; Adams-Bashforth — Error Taylor Error | Ratio of Errors 
1.0 1.000000 0.000000 1.000000 0.000000 

1.5 2.125000 0.016802 2.125000 0.016802 

2.0 3.437500 0.051206 3.416667 0.030372 1.6859 
2.5 4.872396 0.081669 4.833333 0.042607 1.9168 
3.0 6.404427 0.108590 6.350000 0.054163 2.0049 
3.5 8.018294 0.133624 7.950000 0.065330 2.0454 
4.0 9.702798 0.157620 9.621429 0.076251 2.0671 
4 11.449337 0.180989 11.355357 0.087009 2.0801 
9.0 13.251135 0.203946 13.144841 0.097652 2.0885 
5.5 15.102731 0.226617 14.984325 0.108211 2.0942 


6.0 16.999638 0.249081 16.869264 0.118707 2.0983 
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Observe that the error in the Adams-Bashforth approximation for each t; > 2 
is roughly twice as large as the error in the corresponding Taylor approximation. 
If we examine the truncation error terms of the two methods, this result is not 
very surprising. The coefficient on the truncation error for the two-step Adams- 
Bashforth method is 3, which is 2.5 times the truncation error coefficient for the 
second-order Taylor method. The ratio of truncation error coefficients grows even 
larger with increasing order. For:example, the truncation error coefficient for the 
three-step Adams-Bashforth method is nine times larger than the coefficient for the 
third-order Taylor method. For fourth-order methods, the ratio of truncation error 
coefficients is 2h > 40. Clearly, to achieve similar accuracy, Adams-Bashforth 
methods will have to use smaller step sizes than Taylor methods. 

The astute reader is commenting at this point that the comparisons we’ve 
just made are not quite fair and that using smaller step sizes for Adams-Bashforth 
methods is not really a problem. After all, a Taylor method uses more function 
evaluations per time step than the Adams-Bashforth method of similar order. To 
level the playing field, we have to allow the Adams-Bashforth method to use a 
smaller step size and take more time steps. In particular, the two-step Adams- 
Bashforth method should be allowed a step size that is one-half the step size used 
by the second-order Taylor method. For the same number of function evaluations, 
we might therefore expect the errors from the Adams-Bashforth method to be 
roughly 3 /4 = 5/8 as large as the errors from the Taylor method. Figure 7.8 shows 
this to be the case for our two standard initial value problems. In each graph, 
the vertical shift between the errors is roughly 0.21 *& |log,9(5/8)|. When we 
allow the same number of function evaluations for fourth-order methods, we might 
expect the errors from the Adams-Bashforth method to be roughly oh. /256 < 
1/6 as large as the errors from the Taylor method. We will examine this in the 
exercises. 

So, Adams-Bashforth methods appear to outperform Taylor methods when we 
fix the number of function evaluations. What about Runge-Kutta methods? In Sec- 
tion 7.4, we found that on some problems a Runge-Kutta method will outperform 
the Taylor method of the same order, while on other problems the Runge-Kutta 
method will underperform the Taylor method. It is likely that a similar state- 
ment can be made regarding Runge-Kutta methods and Adams-Bashforth meth- 
ods. In particular, Figure 7.9 shows that the two-step Adams-Bashforth method 
produces more accurate results than the Heun method and the optimal RK2 method 
but is slightly less accurate than the modified Euler method for the initial value 
problem 2’ = 1+ 2/t,2(1) = 1. On the other hand, for the initial value prob- 
lem x! = t/z,2(0) = 1, the Adams-Bashforth method underperforms all three 
Runge-Kutta methods. We will compare the fourth-order methods in the exer- 
cises. 


Adams-Moulton Methods 


The derivation of Adams-Moulton methods follows exactly the same procedure as 
the derivation of the Adams-Bashforth methods just presented, with one exception. 
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Figure 7.8 Comparison between the two-step Adams-Bashforth meth- 
od (with A = 0.025 and w, calculated using the optimal RK2 method) 
and the second-order Taylor method (with h = 0.05). The top graph 
corresponds to the initial value problem 2’ = 1+ 2/t, x(1) = 1, while 
the bottom graph corresponds to x’ = t/z, 2(1) = 0. 


Tn addition to interpolating f at t;, ti1, #2, ..., and ti41-m, we also interpolate 
at ti41. Hence, we write 


F(t, y(@)) = Pn(t) + Rm(t), 


where 
Pra(t) = $7 Dm g(t) F (tin1—j,y(tin1-4)) 
j=0 
and 
(ery us 
R(t) = LE) “IIe ~ tit1-3) 


(m +1)! 


Since P,,(t) contains a term involving f(ti+1, y(tit1)), the resulting method will be 
implicit. Furthermore, by using an additional point in the interpolating polynomial, 
the degree of the remainder term is increased by one over the Adams-Bashforth 
case, which implies that we get one more power of # in the truncation error term. 
Therefore, in general, an m-step Adams-Moulton method has local truncation error 
which is O(h™*1), so the method is of order m + 1. 
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Figure 7.9 Comparison between the two-step Adams-Bashforth meth- 
od (with h = 0.025 and wy; calculated using the optimal RK2 method) 
and the second-order Runge-Kutta methods (with h = 0.05). The top 
graph corresponds to the initial value problem x’ = 1+ 2/t, x(1) = 1, 
while the bottom graph corresponds to 2’ = t/x, z(1) =0. 


Let us demonstrate this procedure by deriving the 2-step Adams-Moulton 
method. Since m = 2, we write : 


F(t, y(t)) = 
(t —ti)(¢— ti-1) ae (t-tuit—ta) o, 
(tiga — ty) (tea — fay fits Bi) = (te — tani) (ts — Bayt te wt 

(t — tai)(t — ti) 
(ts—-1 — tear) (ti — &) 


Integrating the Lagrange polynomials yields 


hee (f~t)(t — 4-1) fe _ 5h 
b= | G jat= 5 f s(t iyas= 


FlQs-1sthe) + OHO) — aye tt -t). 


ita — ty) (tiga — tii 


444 (t— tyai)(t —t-1) i: iy 
: (tj — tiga )(ts — te-1) ' (s—1)(s+ 1) ds 3 


and 


Me” dah) h is h 
bo = a dt = = J s(s—1)ds = ——. 
7 [ (ti—-1 — tin (4-1 — ts) 2 Jo ( ) 12 
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Once again, the change of variable t = t; + sh was introduced to simplify the 
calculations. Using this same change of variable in and applying the Weighted 
Mean-Value Theorem for Integrals to the integral of the remainder term produces 


i FE, y(€)) 


At (t = thar) (t — ti)(t — ty) dt 


6 
pagar i é 1 
= he PNG u()) a u(s)) i (s —1)s(s+1)ds 


hs 2 
= rag O 


In the final step, the differential equation was used to replace f’” by y®. Gathering 
everything together and rearranging into standard form, we have 


Yt — YU 3 2 
eo = Taf (tesa Ys) ex gf (tw) = 


1 R38 
jaf iti-1, Yi-1) - ao 


Therefore, the two-step Adams-Moulton method is 


Ug4y 7 Wy 5 2 1 
She jot (tint, Wits) + gi (tis wi) = jaf (ti-1, wis) 


with local truncation error 


3 s 
aye) — op 
n= oy! = O(F'). 
‘To implement this scheme, values are needed for both wo and w;. The initial 
condition is, of course, assigned to wo, while w; can be obtained using any third- 
order one-step method. 
The three-step Adams-Moulton method is given by the difference equation 
Witt Wi 9G 


“19 5 1 
Bo agt (tetas Wier) + oF (tess) — gf (bi-1, Wi-a) + oF fti-2, wi-2),; 


with local truncation error 


Derivation of this method follows the same steps as the derivation of the two-step 
method. The required starting values for the three-step method should be obtained 
using a fourth-order one-step method. 

A more substantial problem than obtaining the required number of starting 
values associated with implementing any Adams-Moulton method is the fact that 
the equation defining w., is implicit. We may be able to transform the equation 
into explicit form, but this happens very rarely and is very problem dependent. 
More than likely, some sort of rootfinding scheme (such as fixed-point iteration or 
Newton’s method) will need to be implemented. Of course, regardless of which 
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rootfinding scheme we choose, the question of convergence arises. With fixed-point 
iteration, convergence will generally be obtained provided A is small enough. For 
those situations when we want to avoid such restrictions on h, which is an issue we 
will discuss in Section 7.9, Newton’s method and the Secant method are common 
choices. But, as we know from Chapter 2, these methods may not converge. 

Before moving on, it is interesting to note that if we were to apply the Adams- 
Moulton approach to deriving numerical methods with m = 1, we would obtain the 
implicit one-step method 


Wi41. —-W;, 1 
a =35 [f (tia, wir) + f (ti, wi)] . 


This scheme, which can also be derived by integrating our model initial value prob- 
lem from ¢t = t,; to t = t;41 and applying the trapezoidal rule to the integral of f, is 
known as the trapezoidal method. You may recall that we encountered this method 
in Section 7.4 when discussing the implementation of the Heun method. 


Predictor-Corrector Schemes 


One common approach to circumventing the need to solve the implicit equation 
associated with an implicit method, such as an Adams-Moulton method, is to use 
an explicit method to “predict” an approximate value, w,41, and then to “correct” 
ii41 tO wig, with the equation of the implicit method. This is the basic idea 
behind a predictor-corrector scheme. 

The Heun method which was developed in Section 7.4 is an example of - 
a predictor-corrector scheme involving one-step methods. Recall that the Heun 
method advances the approximate solution from w; to w;41 according to the rules 


“+ = f (ti, wi) 


and 


iti — Wi i Z 
a =5 [f(tiswa) + f(titn, Be4s)) - 


These equations can be interpreted as using the explicit Euler’s method to predict 
wi+1, followed by the implicit trapezoidal method as a corrector. 

Perhaps the most popular of the predictor-corrector schemes is the Adams 
fourth-order predictor-corrector method. This uses the four-step, fourth-order 
Adams-Bashforth method 


Dig — Ww; I 
LE = 55 [58F (ti, ws) — 59f (6-2, 04a) + 87f(te—2, wi-2) — OF (ts-3, 04-3)] 
as a predictor, followed by the three-step, fourth-order Adams-Moulton method 


Ut. — Wi 


A = x [Of (tea, Wen) + 19f (te, we) — Sf (te-1, wi-a) + f (2, Wi-2)] 
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Figure 7.10 Comparison between the Adams fourth-order predictor- 
corrector (Adams PC) and the classical fourth-order Runge-Kutta 
method (RK4). The top graph corresponds to the initial value prob- 
lem «’ = 1+2/t, 2(1) = 1, while the bottom graph corresponds to 
e =t/x, 2(1) =0. 


as a corrector. This scheme requires only two new function evaluations, f(t;,w,) 
and f(tiz1,Wi41), per time step. The required starting values (w;, we, and w3) are 
typically obtained from the classical fourth-order Runge-Kutta method. 

When applied to the first of our standard initial value problems 


dz 


i 1+ (l<t<6), 2(1)=1, 


a 
the top graph in Figure 7.10 shows that the Adams fourth-order predictor-corrector 
scheme produces errors that are roughly half an order of magnitude smaller than 
the errors introduced by the classical fourth-order Runge-Kutta method (RK4). In 
these calculations, the Adams method used a step size of h = 0.025 as compared to 
h = 0.05 for the RK4 method so as to allow each method roughly the same number 
of function evaluations. When applied to the initial value problem 


d 
Wst/t, (0<t<5), 2(0)=1, 

dt 
however, we see in the bottom graph of Figure 7.10 that the RK4 method provides 
better accuracy than the Adams predictor-corrector scheme. Once again, step sizes 
were selected to allow each method the same number of function evaluations. 
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The Order of General Linear Multistep Methods 


Since not all linear multistep methods are either Adams-Bashforth or Adams- 
Moulton methods, it is worth demonstrating that the order of a linear multistep 
method can be determined by checking certain conditions on the coefficients a; 
and b;, Recall that the local truncation error for the general linear m-step multi- 
step method is 
yltivr) — Dyer ay(tiva-j) oS 
™%= eee — So by’ tis1—j) 
j=0 
where the differential equation y’ = f(t, y) has been used to replace each f(ti41—4; 


yitij) by y'(ti41-j). In order for the method to be of order p, the truncation error 
must be O(h?). In light of the above expression for 7;, this requires 


m 
(ti) )~ Daw tiar—j) — > by! (tig 1g) = O(HP*"). 
j=0 


If we now expand each y and y/ in a Taylor series about ¢* = ti41-m, we find 


(titi) Lats) ~j) —h > yy (tints) 
7=0 


For this last expression to be O(h?*?), the a; and b; must satisfy 
m 
l= me ay 
j=) 


™m m 
m* = So as(m—j)P +k > by (m—j "(k= 1,2,3,... 7); 
a ran 
m 


met! Ya m —JP*? + (p+ 1) 9 b(m — 5). 


j=0 
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Thus, by determining the value of p for which the above algebraic conditions are 
satisfied, we obtain the order of the linear multistep method. 


EXAMPLE 7.16 The Two-Step Adams-Bashforth Method 


For the two-step Adams-Bashforth method, we have 
3 1 
m = 2, a, =1, ao = 0, bo = 0, b = -, and bg = --. 
2 2 
The condition eo aj =1\is clearly satisfied. Starting from & = 1, we then find 


k 1 a, tbo+b. +, =2=m) 
k =2: a, + 2(2b9 +1) =4 =m? 
k=3 “a, +: 3(4bo +1) = Bx mM’. 


Therefore, the two-step Adams-Bashforth method is, as was determined earlier, of 
order 2. 


a 


EXAMPLE 7.17 Milne’s Method 
The explicit four-step multistep method 


Mer = Wind A iop (tsa) — Fleer twa) + AF haa) 


is known as Milne’s method. For this method, we have 


m=A4, Q) = G2 = 03 = 0, ag = 1, 


As in the previous example, the condition es a; = lis clearly satisfied. Starting 
with k = 1, we then calculate 


by + bo + bg = 4 =m 

2(3b; + 2bo + b3) =16= m? 
3(961 + 4b + b3) =64= m 
4(27b, + 8b2 + b3} = 256 =m" 
5(B1b; + 16bp + b3) = 28° #m*. 


tl 


it 
Sn a 


pn a a ee 


Hence, Milne’s method is of order 4. 


EXERCISES 
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1. Derive the difference equation for the three-step Adams-Bashforth method: 


3 


Wi4+1 — 23 4 5 
A = apf lt) ~ FF (Gaui) + 5 flt-a, ia). 


Also derive the associated truncation error: 


3h3 A 
Y= ae a 


. Derive the difference equation for the four-step Adams-Bashforth method: 


. 


Witl — Wi 55 


EF sti, wa) Fe fltina wis) + SUF (tia, wa) 2 fte-s, e-s). 


24 


Also derive the associated truncation error: 


D5 une . 
a ae 


Derive the difference equation for the three-step Adams-Moulton method: 


eater 
Mao Hens, wiss) +55 = F(t e)- agt (ti- 1, Win ty qi (ti-2, Wi-2)- 


Also derive the associated truncation error: 


_ign* 


(S)/Z 
720 ¥ (€). 


T= 


. Derive the difference equation for the trapezoidal method (ie., the Adams- 


Moulton method with m = 1): 


Wit1 — Wi 


; = 5 [eters wis) + f (ti, wi)]. 


What is the truncation error term associated with this method? 


. Apply the two-step Adams-Bashforth method to approximate the solution of the 


given initial value problem over the indicated interval in ¢ using the indicated 
number of time steps. Use any second-order one-step method to determine w1. 
(a) e =te®-2 (0<t<1), 2(0)=1, N=4 

(b) 2 +(4/2est* (<t<3), 2()=1, N=5 

(c) 2 =(sinz—e*\/eosx (0<t<1), 2(0)=0, N=3 

(d) ce =(1+2°%)/t (1<t<4), a(1)=0, N=5 

(e) of =t?-227-1 (0<t<1), 2(0)=0, N=4 

(f) 2 =2(1-a)/(#sinz) (1<t<2), 2(l)=2, N=3 


. Repeat Exercise 5 using the four-step Adams-Bashforth method, but take N = 6 


in each case. Use the classical fourth-order Runge-Kutta method to calculate 
Wi, W2, and W3. 
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7. Repeat Exercise 5 using the Adams fourth-order predictor-corrector method, but 


10. 


12. 


13. 


14. 


15. 


take N = 6 in each case. Use the classical fourth-order Runge-Kutta method to 
calculate wi, w2, and ws. 


. In the last worked example of the section, we determined that Milne’s method is 


of 4. Determine the complete form of the truncation error term for Milne’s 
method. 


. Determine the order and the complete form of the truncation error term for the 


indicated linear multistep method. 
(a) witi ~ win = Zh Uf (tenn, wig) +4f (ti, wi) + f(te—1, wi-1)] 
(b) wit1 ~ 4; + 3we_1 = —2hf (ts_1, wii) 
(c) witi — wi-1 = 2hf (ti, w;) 
(d) witr — wiz 
= Fh [f(teva, wigr) + 3f (ts, ws) + 3f(ti-1, Wi-1) + f Li-2, wi_2)] 
(e) weer — gust Swi = BAS (teg1, wit) 
(fF) wir — ws — Zea =A [f (tog, wie) — Ef (Es, we) + 2 f(te-1, wi-1)] 
Use each of the following initial value problems to demonstrate that the global 
error associated with the two-step Adams-Bashforth method is O(h?). 


(a) a =4tva? +1/e (0<t<5), 2(0)=1, x(t) = J(224 V2)? -1 
(b) a’ =1+a2/t (1<t<6), 2x(1)=1, 2(t)=t01+In2) 

(c) e =t/z (O<t<5), 2(0)=1, 2(t)=VeF1 

(d) 2’ =2(2+1)/e (<t<2), 2(0)=-2 

(e) e =(-tete)/(tz) (1<t<2), 2(l)=2 


L1, Numerically demonstrate that the global error of the four-step Adams-Bashforth 


method is O(h*) using the initial value problems from Exercise 10. 


Numerically demonstrate the global error of the Adams fourth-order predictor- 
corrector method is O(h*) using the initial value problems from Exercise 10. 


Use each of the following initial value problems to compare the performance 
of the two-step Adams-Bashforth method with the second-order Taylor method. 
Remember to adjust step sizes so that each method uses roughly the same number 
of function evaluations. 
(a) e =1-a2+e%2? (0<¢<0.9), x(0)=0, x(t) =e7*tan(et — 1) 
(b) 2’ =(t2? -—a)/t (1<t<5), 2(1)=—1/In(2), x(t) = —1/(éln(2¢)) 
(c) 2 =-(14+t427)-(Qt+1)2—-2? (0<t<3), 2(0) = ~1/2, 

a(t) = —t —1/(e +1) 
(4) of =(4/etttet (1<t<2), 2f)=0, a(t) =tet—e) 
Compare the performance of the two-step Adams-Bashforth method with the 
modified Euler method, Heun’s method and the optimal RK2 method. Use the 
initial value problems from Exercise 13, and remember to adjust step sizes so 
that each method uses roughly the same number of function evaluations. 
Compare the performance of the four-step Adams-Bashforth method with the 
classical fourth-order Runge-Kutta method. Use the initial value problems from 
Exercise 13, and remember to adjust step sizes so that each method uses roughly 
the same number of function evaluations. 


16. 


17. 


18. 


19, 
20. 
21. 


22, 
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Compare the performance of the four-step Adams-Bashforth method with the 
fourth-order Taylor method. Use the initial value problems from Exercise 13, 
and remember to adjust step sizes so that each method uses roughly the same 
number of function evaluations. 

Compare the performance of the Adams fourth-order predictor-corrector method 
with the classical fourth-order Runge-Kutta method. Use the initial value prob- 
lems from Exercise 13, and remember to adjust step sizes so that each method 
uses roughly the same number of function evaluations. 


(a) Consider the initial value problem 


a 

. =—-(i+t+e?)~-(%+1)2-2? (<t<3), 2(0)=-2. 
‘The exact solution of this problem is x(t) = —t-1/(e' +1). With what rate 
does the two-step Adams-Bashforth method converge to this exact solution? 

(b) Repeat part (a), but change the initial condition to 2(0) = —1. The exact 
solution in this case is a(t) = —t— 1. 

(c) Explain any difference in rate of convergence between parts (a) and (b). 

Repeat Exercise 18 using the four-step Adams-Bashforth method. 

Repeat Exercise 18 using the Adams fourth-order predictor-corrector method. 


(a) Consider the initial value problem 
Sta SST Sb) ele 2: 


The exact solution of this problem is x(t) = t? + 1/t*. With what rate does 
the four-step Adams-Bashforth method converge to this exact solution? 
(b) Repeat part (a), but change the initial condition to z(1) = 1. The exact 
solution in this case is 2(t) = 2°. 
(c) Explain any difference in rate of convergence between parts (a) and (b). 
(d) Repeat parts (a), (b), and (c) using the Adams fourth-order predictor- 
corrector method. 


In the “Projection Printing” problem capsule of the Overview to Chapter 1 
(see page 6), the following initial value problem for the normalized photoactive 
compound concentration, M, inside a resist film after exposure was developed: 


dM (z, texp) 


ae = M(z,texp) [A(1 — M(z, texp)) — Bln M(z, texp)] 


M(0, texp) = exp(—IoCtexp). 


In this problem z denotes depth into the resist film, A, B, and C are properties 
of the resist material, and the product Itexp is the exposure energy of the light 
source used during the illumination phase. For the resist material AZ2400, it has 
been determined that A = 0.162/pm, B = 0.184/um, and C = 0.0128 cm? /m/J. 
If the exposure energy is 110 mJ/cm*, what is the normalized photoactive com- 
pound concentration as a function of depth? Tabulate M for z ranging from 0 
um through 1 ym, in increments of 0.05 ym. 
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23. A common method for improving the performance of projection printing in op- 
tical microlithography is to place a layer of film, known as a contrast enhancing 
layer (CEL), on top of the photoresist material. The following initial value prob- 


lem serves as a model for the concentration of photoactive compound within the 
CEL, Me: 


Mz 
dz 


& 
2 
M.({0) = exp(—ICct). 


= M, [E(1 ~ M2) + 6(1— Mc) ~ Beln Me] 


Here, a = 1.4A¢; 6 = —0.06Ac; Ac, Bc, and Ce are material properties of the 
CEL material; the product J¢ is the exposure energy of the light source: and z 
denotes depth into the CEL film. For a particular polysilane CEL, it has been 
determined that Ac = 8.93/ym, Be = 0.175/pm, and C. = 0.0376 cm?/mJ. 
If the exposure energy is 110 mJ/cm?, tabulate Mc for z ranging from 0 jm 
through 0.2 um, in increments of 0.005 pm. 


7.6 CONVERGENCE AND STABILITY ANALYSIS 


Earlier in the chapter it was noted that the analysis of numerical methods for 
initial value problems focuses on two types of error, local truncation error and 
global discretization error, and on the three important properties of consistency, 
convergence and stability. When we investigated Euler’s method in Section 7.2, the 
simplicity of the difference equation allowed us to perform a complete analysis of 
the method. In subsequent sections, however, the difference equations became more 
complicated. Rather than perform a complete analysis of each of these methods, we 
dealt almost exclusively with the local truncation error. Global error was treated 
only with numerical experiments, and stability was ignored entirely. In this section, 
we will remedy this situation by developing a general framework for convergence and 
stability analysis of both one-step and linear multistep methods. For completeness, 
consistency will also be included in our discussions. 


Recalling the Key Concepts 


Since we’ve covered a lot of material since Section 7.1, it is probably worthwhile to 
review the basic notation we will use and the central concepts we will discuss. We 
seek to approximate the solution of the initial value problem 


y’ = f(t,y), y(a) =o; 


over the interval a <¢ <b. For simplicity, the approximate solution is obtained at 
equally spaced points; that is, for some positive integer N, we define the step size 
h = (b—a)/N and then set t; = a+ih for each i =0,1,2,...,N. The value of the 
exact solution of the initial value problem at t = t;, y(ti), is denoted by y:, and the 
approximation to y; is denoted by w;. 
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Local truncation error, 7;, measures the amount by which the solution of 
the differential equation fails to satisfy the difference equation that defines the 
numerical method. If 7, —- 0 as h — 0, then the method is called consistent. 
Additionally, if r, = O(h?) for some constant p, the method is said to be of order D. 

Global discretization error measures the difference between the solution of the 
differential equation and the solution of the difference equation, y; — w;. Suppose 
we let N tend toward infinity (note that this will force the step size h = (b—a)/N 
toward zero], keeping ty = 5 regardless of the value of N. If it follows that 


li ;— wil =0 
we say the method is convergent. 
Stability deals with sensitivity of the solution of the difference equations to 
changes in initial conditions. For a given numerical method and a fixed N, let {w,} 
denote the sequence of values generated from an initial condition of a and {w;} 


denote the sequence generated from an initial condition of &. The numerical method 
is stable if and only if there exists a function k(t) > 0 such that 


2b; ~ w,;| < k(ta)|& — a| 


for all i. The key point is that the function k(¢;) must be independent of h. 

Of course, the most important property for a numerical method to have is 
convergence. If a method does not converge (i.e., if the solution of the difference 
equation does not tend toward the solution of the differential equation), then the 
method is useless. Unfortunately, direct proofs of convergence can be hard to 
obtain, whereas checks for consistency and stability tend to be much easier. In 
what follows, we therefore want to determine whether we can link the consistency 
and stability of a method with convergence of that method. 


One-Step Methods 
Analysis for the one-step method 


oe = O(f, ti, wi, h) (1) 


is rather straightforward. We start with consistency. The local truncation error 
associated with (1) is 


n= ae ~ af, te, Yay h). 


If we take the limit as h — 0 of this last equation and assume that ¢ is continuous 
in A, we obtain 


jim y= y (8) = OCF, ts, yi, 9) 
= f(ti,¥) — O17, tas Wi, 9). 
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Therefore, the method is consistent provided 


Sf (ta, ys) = otf, ti, yi, 0), 


which is a simple condition to verify. 


EXAMPLE 7.18 Consistency of Modified Euler Method 
Recall that the difference equation for the modified Euler method is 


Wit1 ~ Wi 


ie h h 
h =F (6+ Gout Bran) 
hence, 


Of, ti, wi, h) = f (« + Pa oF f(tm)) : 


Tt then follows that 


O(f, tis ys 0) = f € + stk + s(t) = fltis yi), 


so the modified Euler method is consistent. 


The following theorem provides the conditions for a one-step method to be 
stable, as well as for a one-step method to be convergent. Note the role of the 
Lipschitz condition in this theorem, and compare the hypotheses of this theorem 


with those for a well-posed initial value problem. 


Theorem. Suppose the one-step method given in (1) is applied to the initial 
value problem y’ = f(t,y), y(a) = a, where h = (b— a)/N for some positive 
integer N andi; = a+ith. Let D = {(t,y)Ja <t < b,y © R}, and Dy = 
{(f,t,y, b)| f(t, y) is Lipschitz in y on D, (t,y) € D,O0 < h < ho} for some ho. 
If (f,t, y, 2) satisfies a Lipschitz condition in y on Dg with Lipschitz contant 


Lg, then 


1. the one-step method is stable; and 


2. if there exist positive constants c and p such that |7;| < ch? for each 


i1=1,2,3,...,N whenever 0 <h < ho, then 


tm — we] <P eters) a}, 
¢ 


Proof. (1) Let {u;} and {w;} be numerical solutions generated by (1) from 


initial conditions of a and &, respectively. Then 


Wi4i — Wi Wig — Wi 
h, h 


< Lg \t; - wil. 


ee |O(f, ta Wi, h) ae olf, te, Wi, h)| 
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After some algebraic manipulation, this yields 
big. — Wii] < (1+ ALg)\a, — wy. 
Next, apply this last formula recursively to obtain 
[evr ~ weg] S (L+ ALG) [tq — wo. 
But Wo = & wo =a, and OS 1+ hlg < e*4, 50 
Hesy — Wigs] S eM ela — a] = oP eli) a — a, 


since (+ 1)h = ti41 —a. Thus k(t) = e£¢¢-®, which is independent of h, 
and the method is stable. 

(2} The proof of this part is similar to that of part (1) and will be left as an 
exercise. oO 


The second part of this theorem tells us two important things, First, if a 
one-step method is consistent, then it is also convergent. Therefore, we can guar- 
antee that a particular one-step method converges by showing that the method is 
consistent and that the function ¢ satisfies a Lipschitz condition. Second, if the 
truncation error is of order p, then the global error will also be of order p. This 
provides theoretical justification for the conclusions we drew from the results of 
our numerical investigations with Taylor and Runge-Kutta methods in Sections 7.3 
and 7.4. 

The connection between consistency and convergence for one-step methods is 
actually stronger than is suggested by this theorem. Under the same hypothesis 
on @, it can be shown (see Gear [1] or Henrici [2] for details) that convergence implies 
consistency. Hence, a one-step method converges if and only if it is consistent. 


EXAMPLE 7.19 Convergence and Stability for Modified Euler Method 
For the modified Euler method 


Sistruh) =F (e+ Fu pio), 


Suppose that f satisfies a Lipschitz condition in y on the ses D= {(t,yla st < 
by € R} with Lipschitz constant 2. Then 


\a(f,6,91,4) ~~ (ft, ve, h)| 
=|p(t+ Bus Situ) -s (e+ font G7te0))| 


<b 


h 
mt gilsy) we $f ltve) 


< Llyy — yal + “sew — f(t, ye)| 


hl? ho? 
< Lily, ~ ye| + a lu — yl = (14 =) Iya — yal. 
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Therefore, ¢ satisfies a Lipschitz condition in y on the set 
{(f,t.u,A)| {(t,y) is Lipsebitz in y on D, (t,y) € D,0 <b < ho} 
for any hg > 0 with Lipschitz constant 


ho L? 
2 


Hence, we may conclude that the modified Euler method is stable. Since we 
have already established that the modified Euler method is consistent, we may also 
conclude that the method is convergent. Furthermore, because the local truncation 
error for this method is O(h?), it follows that the global error is also O(h”). 


Lg=b+ 


Linear Multistep Methods 


Consider the general linear m-step multistep method 


Wet1 — G1, We — GQWi-1 — 1+ 7 OmWiti-m _ 
he 


bof (tear, wita) + bi f (ts, wi) + bef (tia, Wi-a) ++ + Omf (tita—m, Wit1—m). (2) 


Before we turn to a development of the conditions required for (2) to be consistent, 
stable and convergent, there is a technical detail that must be addressed. Recall 
that the values for w ,,w2,%03,...,Wm-_, are obtained from a different numerical 
method. We cannot reasonably expect the values that are then obtained from (2) 
to converge to the solution of the initial value problem if these starting values don’t 
converge. For each i = 1,2,3,...,m—L, let ¢; denote the error associated with the 
starting value w;. Throughout the remainder of our analysis, we will assume that 
there exists a function (A) such that |e;| < (A) for each 4 = 1,2,3,...m—1 and 
such that e(h) + 0 as h — 0. 

For (2) to be consistent, we must have 7; > 0 as h > 0. This requires that 
the local truncation error be at least O(h); that is, the method must be at least 
first order. From our work at the end of Section 7.5, the coefficients a; and b; must 


satisfy 
m 
Soa =1 and Saj(m—j) + obj =m 
j=l j=l j=0 


for the method to be at least first order. By using the first of these relations, the 
second relation can be simplified to — )0j 945 + dyno by = 0. Putting all of this 
together, it follows that the consistency conditions for linear multistep methods are 
given by 


554 = and — Sy jaj + Db = 0. (3) 
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EXAMPLE 7.20 The Leapfrag Method 
The explicit two-step method 


Wit1 — Wi-] 


i = 2f (ti, wi) 
is known as the leapfrog method, or the midpoint method. Here, we have 
m = 2, ay = 0, a2=1, bp = 0, by = 2, and bo = 0. 
Substituting these values into the consistency conditions (3), we find 


aj+a2=0+1=1; and 
—(4) + 2a2) + bo +1 +2 = ~(04+ 2) +04+240=0. 


Therefore, the leapfrog method is consistent. 


Let’s now turn our attention to stability. For the moment, suppose that 
f(t,y) = 0. With this assumption, equation (2) becomes 


Wi4. — G1 Wy ~ AQWi-1 — + — OmWit1—-m = 0. 


Let {w,} and {w;} denote solutions obtained from this simplified difference equation 
for different initial conditions. Subtract the equation for the w,; from the equation 
for the w,;, and define v; = w;—w;. It follows that v; satisfies the difference equation 


Usp1 — YVj — AQUji-y — +++ — AmViti—m = 0. (4) 
Associated with (4) is the characteristic polynomial of the multistep method 
P(A) =X = a: APT} = agQX™? = — om, 
If A is a zero of p, then v; = * is a solution of (4) since 
NTE ay dt — ag dP? = ag APIW™ = AI ™ PLD) = 0. 


Now, suppose that |A| > 1. This implies that |v,;| = |A|* is unbounded as 7 — oo. 
In other words, the difference between w,; and w; is unbounded, and the method 
is unstable. Alternatively, suppose that A] = 1 but that A has multiplicity greater 
than one. The expression v; = id is then a solution of (4). As 7% — oo, |u| = 
i|A|* = 4 is unbounded and the method is again unstable. These results motivate 
the following definition. 


Definition. The linear multistep method given by (2) satisfies the Roor. 
CONDITION if the zeros of the characteristic polynomial associated with the 
method, 

PO) Sa" San oak ea, 


satisfy 
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1. |A| <1 for all zeros d; and 
2. if /Aj = 1, then A is a simple zero. 


We have essentially just established that when f(t,y) = 0, if a method does 
not satisfy the root condition, then the method is not stable. Equivalently, if a 
method is stable, then it must satisfy the root condition. It can be shown (see 
Isaacson and Keller [3]) that the same result holds when f (t,y) is not identically 
zero and that the root condition implies stability, whether f(t, y) is identically zero 
or not. Hence, we have the following result. 


Theorem. A linear multistep method is stable if and only if it satisfies the 
root condition. 


Note that P(1) = 1- a; -a,~—-++—am. Using the first of the relations 
in (3), we see that for a consistent linear mutlistep method P(t) = 0. Hence, a 
consistent and stable linear multistep method will always have at least one zero 
with magnitude equal to 1. We subclassify the stability of a method based on 
whether or not there are other zeros of unit magnitude. 


Definition. A stable multistep method is said to be STRONGLY STABLE if 
A = Lis the only zero of P(A) with || = 1 and is said to be WEAKLY STABLE 
otherwise. 


We will explore the consequences of weak stability below and in the exercises. 


EXAMPLE 7.21 Stability of Adams-Bashforth and Adams-Moulton 
Methods 


Regardless of the number of steps, m, all Adams-Bashforth and Adams-Moulton 
methods have a; = 1 and a; = 0 for each j = 2,3,4,...,m. Thus, for each m, the 
characteristic polynomial associated with one of these methods is of the form 


POYS A a? SP): 


The zeros of this polynomial are 4 = 0 (of multiplicity m— 1) and \ = 1. Conse- 
quently, Adams-Bashforth and Adams-Moulton methods of all orders are strongly 
stable. 


EXAMPLE 7.22 Stability of Leapfrog Method 
For the leapfrog method, we know that m = 2, a, = 0 and a2 = 1. The character- 
istic polynomial for this method is-then 

P(A) =? -1= (A- 1A). 


The zeros of P(,) are clearly 1 and —1, which implies that the leapfrog method is 
weakly stable. 
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Finally, let’s tackle the question of convergence for linear multistep methods. 
Suppose (2) is convergent, and consider the initial value problem y’ = 0, y(0) = 1, 


whose exact solution is y(f) = 1. Applying (2) to this initial value problem, the 
resulting difference equation is 


Witi - Yajwiya— 7 =0, (5) 


j=l 


Since the solution of the initial value problem is constant, we seek a constant 
solution of (5). Substituting w; =, for some constant c, into (5) yields 


o(1 - S74) = 0. 


j=l 


If 072, a; # 1, the only constant solution of (5) is wi = 0. This, however, violates 
the assumption that the method is convergent, so we must have yyet aj=l, 

Continue to suppose that (2) is convergent. Consider the initial value problem 
y' = 1,y(0) = 0, which we will solve over the interval 0 < t < T for some fixed T. 
Furthermore, we will always select the step size, h, and the number of time steps, N, 
so that Nh = T. The exact solution in this case is y(t) = t, and the resulting 
difference equation is 


™m m 
Wit — So ay wini-y = hS by. (6) 
j=l j=0 


Since the solution of the initial value problem is linear in the independent variable, 
we seek a solution of (6) in the form w; = ci for some constant c. After some 
algebraic manipulation, and using our previous result that jel aj; = 1, we find 


that : 
= h Daj=0 bs 
Diyel jay 


For a convergent method, it can be shown that S71 ja; #0 (Exercise 14). There- 
fore, 


c 


Dyno OF Dyno O5 yo 5 
= and, in particular, wy = Nh —=T —, 
ee doje Jay Gy Jay 


Now, because the method is convergent, 


: paar b; 
T=y(7) = Nim en = Sern 


which implies that yo bs / rym 8 = 1, or, equivalently, that — ean jaj+ 
ar bj = 0. 
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We have just shown that if a linear multistep method is convergent, then the 
coefficients a; and b; must satisfy 


ae and Serre ere 
j=1 j=l j=0 . 


These are precisely the consistency conditions we stated in equation (3). Therefore, 
a convergent linear multistep method must be consistent. By considering the initial 
value problem y’ = 0,y(0) = 0, we can show that a convergent linear multistep 
method must satisfy the root condition (see Exercise 13), and hence must be stable. 
Therefore, convergence implies both consistency and stability. It can also be shown 
(see Henrici [2] or Isaacson and Keller [3]) that consistency and stability together 
imply convergence. Our final result is then 


Theorem. A linear multistep method is convergent if and only if it is both 
consistent and stable. 


EXAMPLE 7.23 Convergent but Weakly Stable versus Convergent and 
Strongly Stable 


In previous examples we have established that the leapfrog method is consistent 

and stable, though only weakly stable. It follows that the leapfrog method is 

convergent. To demonstrate the consequences of the weak stability of this method, 

we will compare the leapfrog method with the stongly stable two-step Adams- 

Bashforth method. We have chosen this strongly stable method because both the 

leapfrog method and the two-step Adams-Bashforth method are second order. 
The test problem is 


d. 
a +a=sint, 2(0)=1. 


Figure 7.11 displays the approximate solutions for this problem, leapfrog method in 
the top graph and Adams-Bashforth method in the bottom graph, over the interval 
0 <# < 10 using N = 100 time steps. Observe the “saw-tooth” oscillations in 
the leapfrog method solution that appear around ¢ = 7 and grow in amplitude for 
increasing t. This phenomenon is a result of the weak stability of the numerical 
method. Reducing the step size will not eliminate this problem, it will only delay 
its onset (see Exercise 15). 
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Figure 7.11 Comparison between the leapfrog method (top graph) 
and the two-step Adams-Bashforth method (bottom graph) for the ini- 
tial value problem x’ + x = sint, x(0) = 1 using a step size of h = 0.1. 


EXERCISES 
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1. Prove the second part of the stability and convergence theorem for one-step 


methods. 


2. Comment on the consistency, stability, and convergence of Heun’s method, as- 
suming the function f(¢, y) satisfies a Lipschitz condition in y on the set D = 


{(t,y)|a <t<b,y € R} with Lipschitz constant LD. 
3. Repeat Exercise 2 for the optimal RK2 method. 
4. Repeat Exercise 2 for the classical fourth-order Runge-Kutta method. 


5. Repeat Exercise 2 for the Taylor method of order p. Here, assume that f, f’, 
" ..,f7 all satisfy a Lipschitz condition in y on the set D = {(t,y)|a < 


t < 6,y € R} with common Lipschitz constant LD. 


In Exercises 6-11, comment on the consistency, stability, and convergence of the indi- 


cated linear multistep method. 


1 
6. wigi — Wi-1 = gf [f(ti+n, wits) + 4f (ti, wi) + f(t—1, wi) 


4 1 2 
7. Wit. ~ yu + zur = gh fltin1, witr) 


3 
8. wig) ~ Wi-2 = 3? [F (tig1, we41) + 3f (te, wa) + 8f(ti-1, wy-1) + f(ti-2, wi-2)] 


1 1 1 3 
9. wit — 5wi quini = h [Fees wet) - qi (tis wi) + Sf (ti, w-1)| 
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18 9 2 6 
10. wit. — eit ein — yt? = 7S (tien, wis) 
LL. wig. — 4a, + 3uj;_) = —2hf (ti-1, Wi-1) 
mM » 

12, Suppose that Lym aj = 1 Show that pee a;(m — 3) + WS, bj = m is 
equivalent to — S07") jay = dja 55. 

13. Show that a convergent linear multistep method must satisfy the root condition 
by considering the initial value problem y’ = 0, y(0) = 0. 

14. Show that for a convergent linear multistep method pee jay #0. (Hint: Show 
that the root condition and dj=1 4 = lL imply that P/(1) 4 0, and then show 
that this implies nl ja; #0] 

15. Reconsider the solution of the initial value problem 2’ + x = sint, 2(0) = 1 
using the leapfrog method. Show that if the step size is reduced to h = 0.01, but 


the interval is extended to 0 < t < 20, sawtooth oscillations still appear in the 
approximate solution. 


16. Recall that Milne’s method is the linear multistep method given by 


Wit — Wi-3 


h = ; [2f (ti, we) — f (tea, we-1) + 2f (t-2, wi-g)] - 


(a) Show that Milne’s method is weakly stable. 

(b) Compare Milne’s method and the four-step Adams-Bashforth method for 
solving the initial value problem z'+x2=sint, 2(0)=1. Useh=0.1 and 
compute the approximate solutions over the interval 0 < t < 10. 


(c) Show that if the step size is reduced to h = 0.01, but the interval is extended 
to0 <¢< 15, sawtooth oscillations still appear in the approximate solution 
obtained from Milne’s method. 


7.7 ERROR CONTROL AND VARIABLE STEP SIZE ALGORITHMS 


The function whose graph is shown in Figure 7.12, z(t) = ¢/(1+¢#°), is the solution 
of the initial value problem 


dx 
dt 


1 
co aa 


If we were to use one of the numerical methods developed earlier in the chapter to 
solve this initial value problem, we would have to select a step size small enough to 
resolve the rapid variation in the solution that occurs from t = 0 through roughly 

= 2, With that step size, we would then be doing much more work than necessary 
to resolve the more slowly varying portions of the solution for t > 2. 

This situation is reminiscent of the one encountered in Chapter 6 with nu- 
merical integration. There, it was advantageous to develop algorithms which au- 
tomatically selected the quadrature points to control the error in the computed 
approximation, In this section, we will discuss several different approaches for the 
construction of variable step size initial value problem solvers, which adapt the step 
size as the approximate solution evolves. 


= —38tz? + ; x(0) =0. 
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Figure 7.12 Solution of the initial value problem 2’ = —3ta? +1/(1+ 
t°), 2(0) = 0. 


One-Step Methods 


Let’s start with one-step methods. Multistep methods will be considered toward 
the end of the section. The first component of any variable step size algorithm 
for the solution of initial value problems is a procedure for estimating the error in 
the approximate solution. Ideally, we want an estimate for the global error in the 
approximate solution; however, such estimates are very difficult to come by. We 
will therefore focus on controlling the local truncation error in the approximate 
solution. 

Let w; be an approximation to the true solution of the initial value problem 


y(t) = f(t,y), y(to) = yo (1) 


at t = t,. Suppose that we want to advance the approximation to ¢t = ti4, =t,+h 
using the one-step method 


Wit, = Wi oe holf, ty, h, ws), (2) 


which has local truncation error of order a@,. We can estimate the local truncation 
error that is introduced by this time step by comparing the value w;,1 computed 
above with the value generated by a different one-step method 


Winr = wi + AGS, ti, h, wi), 


which has local truncation error of order a2, where @ > a. The details of this 
process are as follows. 
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ou Let Vir denote the exact value of the solution of equation (1) subject to the 
initial condition 9(t;) = w,. By the definition of local truncation error, 


Vit — Wi 
ea — Of, ti, h, wy) = ky h™ + o(h). (3) 


Combining the terms on the left-hand side of equation (3) and substituting from 
equation (2) yields 


fat hae aie Vier — wi ~hd(f teh wi) _ dis — Wie 
h h 
Therefore, 
Git — Witt 
h 
In a similar manner, we can establish that 


= kyh™ + o(h™). (4) 


Yitt 7 = O(h%), (5) 
Subtracting equation (5) from equation (4) and using the fact that O(h%) is o(h®) 
since @2 > a1, it follows that 


ig = Wie pen o(h™). 


h 


Finally, assuming that the o(h®) term can be neglected, we see that 


«= Wit Wind 

A 
is an estimate for the local truncation error introduced by the time step taken with 
the lower-order method. 

Hawing a credible estimate for the local truncation error in w;41 available, 
it seems reasonable to use this estimate to improve the accuracy of wj4,. This 
process, being similar to the extrapolation we applied to numerical differentiation 
and integration formulas in Chapter 6, but here being based on a local error esti- 
mate, is commonly known as local extrapolation. By substituting € for kyh™ in (4) 
and simplifying, we find 94) * Wi41; hence, local extrapolation is equivalent to 
advancing the approximate solution using the higher-order method. Because local 
extrapolation increases accuracy at no additional computational expense, we will 
use it in the adaptive one-step methods we implement below. 

Let TOL denote the maximum allowable local truncation error. If |e| > TOL, 
we must reduce the step size and repeat the step. If |e| < TOL, we may want to 
increase the step size so as to reduce the overall workload. This leads to the second 
component needed to construct a variable step size algorithm for the solution of 
an initial value problem: a procedure for automatically adjusting the step size. To 
determine an appropriate size for the next time step, recall that € is an estimate 
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for the error introduced when the step size is h. Since we are working with an 
underlying numerical method of order ay, if the next step is tale to be of size gh, 
for some real number g, it follows that 


error with step —,, error with step 


‘sag sep 
of sizegh = of size h gs 


We can therefore select q to satisfy q™|e| < TOL, or 


ie 
q< | — 


le| 


Since several approximations have been made in arriving at this expression, we will 
be conservative and choose gq to satisfy 


TOL\ /™ 
< (Sa) 


To prevent one very accurate time step from producing a large increase in the step 
size, as well as to prevent one very inaccurate time step from producing a large 
decrease in the step size, it is common to restrict the allowable range of g values. 
We will adopt the convention that 0.1 <q < 4.0. 

Any two methods of different order can be used to construct a variable step 
size algorithm according to the above process. For example, the classical fourth- 
order Runge-Kutta method can be used to estimate the local truncation error in- 
troduced by the third-order Taylor method. Using arbitrary methods, however, 
leads to a large overhead expense in terms of additional function evaluations. For 
the two methods listed in the previous sentence, the third-order Taylor method, 
by itself, requires three function evaluations per time step. Introducing the RK4 
method to estimate the truncation error requires three more function evaluations— 
the value f(t;, w;) can be reused. Hence, estimating the truncation error with these 
two methods doubles the computational effort per time step. To reduce the cost. 
of estimating the truncation error, we need methods that share function values. 
Runge-Kutta methods are ideal for this situation. Given that there are many dif- 
ferent Runge-Kutta methods of each order (recall that there are an infinite number 
of different. second-order Runge-Kutta methods), it is possible to design special 
pairs of methods, called embedded pairs, that share several function values. For 
details of this design process, see Butcher [1] or Iserles [2]. 

A popular technique that falls into this category is the Runge-Kutta-Fehlberg 
fourth-order—fifth-order scheme, which will hereafter be referred to as the RKF45 
method. This algorithm uses the fifth-order method 

16 6656 28561 


F 2 
Wier = wit Taek + Tygon ks + Zeaggh — wits + = 55S 


to estimate the local truncation error in one time step of the fourth-order method 
1408 2197 1 


igi = + 5 6° + a565%3 * Toa ~ Bh 
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where 
=hff ti, wi) 


hi (a4 +- A, Wit it) 


ze gh wi + aha kn) 
-11 (0+ Haus My - Bo BL) 
=hf aa + hyw, + ~ — 8k2 + “as = a 


Note this technique uses just six function evaluations per time step. (An arbitrary 
corabination of fourth- and fifth-order methods would require at least nine function 
evaluations—four for the fourth-order method plus six for the fifth-order method 
minus the reused value f(t;, w;). See Butcher [1] for a discussion of why at least six 
values are needed to achieve fifth-order accuracy.) The truncation error estimate 
for the RKF45 method takes the form 


7 1 128 2197 
— itl 7 Witi _ 3601 — ag75k3 — 7egagka + § gaks + & ashe 
h h 


Since the lower-order method in this pair is fourth order, the step size adjustment 
factor is given by 
PP OEN TOL 
qg= | —— 0.84 
2le| “Tel 


Let’s apply the RKF45 method to the initial value problem posed at the 
beginning of the section: 


dz 1 
—3tx? + ; 
a 1+28 


We will approximate the solution over the range 0 < ¢ < 5. In addition to the 
function on the right-hand side of the differential equation, the initial condition 
and the final time, the algorithm requires three other parameters: the smallest 
allowable step size, Amin, the largest allowable step size, hmax, and TOL. For this 
problem, the values Amin = 0-01, Rmax = 0.25, and TOL = 5 x 1077 were used. 
Results are summarized in Figure 7.13. The graph in the top left panel displays 
the approximate solution, while the graph in the top right panel displays the global 
error in the approximate solution. The graph in the bottom left panel illustrates the 
evolution of the step size. Note how the step size automatically adjusts in response 
to the variation in the solution curve, decreasing to resolve smaller scale features 
and increasing to reduce computational effort in regions of larger scale features. In 


a(0) = 0. 
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Figure 7.13 (Top left) Approximate solution of 2’ = —3tx#? + 1/(1 + 
t°), 2(0) = 0 obtained using the RKF45 method. (Top right) Global 
error in approximate solution. (Bottom left) Step size used by RKF45 
method as a function of location. 


total, the RKF45 method required 48 time steps to traverse from t = 0 tot = 5 
and used 318 function evaluations. 


EXAMPLE 7.24 Some of the RKF45 Calculations 

Let’s take a look at how some of the calculations that led to Figure 7.13 were carried 
out. From tg = wo = 0, we attempt a step of length A = hyax = 0.25. With these 
values and f(t,2) = —3tz? + 1/(1 +3), we find 


ky, = 0.250000000, ko = 0.249755874, k3 = 0.249177100, 
kg = 0.237901556, ks = 0,234571481, ke = 0.248061592. 


The resulting estimate for the local truncation error is 


1p, _ 128 2197 1 2 
= Soot = aa7shs — zenagha + goks + g5ke 


ae -6 
h = —4,5831 x 107°. 
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Since |e] > TOL = 5 x 1077, we reject these calculations and try again with a 
sraaller step size. The step size adjustment factor is found to be 


TOL\ "4 
a= (ee De | -) = 0.482759999, 


so we choose h = (0.482759999) (0.25) = 0.120690000. 
With to = wo = 0 and this new value for h, we obtain 


k, = 0.120690000, kp = 0.120676739, kg = 0.120645252, 
ky = 0.120023663, ks = 0.119842509, kg = 0.120584009 


and the truncation error estimate « = —6.0064 x 10-8. Now |e] < TOL, so these 
calculations are deemed sufficiently accurate. We advance the independent variable 
to t; = to + h = 0.120690000 and, taking advantage of local extrapolation, accept 


16 6656 2 1. 
es 856 i) 


2 
135 72895 3+ 56430 ka — 508 + FE se ke = 0.120478216 


as an approximation to x(t). 

To prepare for the second time step, we compute a step size adjustment fac- 
tor of g = 1.426818220 and therefore select h = (1.426818220)(0.120690000) = 
0.172202690. The next round of calculations then yields 


k, = 0.170995491, ke = 0.169196284, k3 = 0.167870858, 
kq = 0.157684015, ks = 0.155588578, kg = 0.166208019 


and ¢ = 1.2658x 10-7. This value is, in magnitude, smaller than TOL, so the inde- 
pendent variable is advanced to tz = t; + h = 0.292892690 and we = 0.285713862 
is accepted as an approximation to x(t2). 

The third time step is attempted with a step size of h = 0.203923299. Unfor- 
tunately, this leads to a truncation error estimate of ¢ = 1.4976 x 10—®, which is 
too large. The third time step is then recomputed with h = 0.130207704. Now, € = 
1.8269 x 1077 < TOL. The independent variable is advanced to t3 = 0.423100394 
and wW3 = 0.393310681 is accepeted as an approximation to a(t3). This process is 
continued until the independent variable has been advanced to t = 5. 


Another common embedded pair is the Runge-Kutta-Verner fifth-order-sixth- 
order method (RKV56), which uses the sixth-order method 


3 875 23 264 125 43 


Wis = wit Zaki + o9gq's t a aha + T9355 "5+ Tiga” + gies 


to estimate the local error in the fifth-order method 


13 2375 12 3 
eer Ty — 2a seks + ke, 
Wit) wi + F65 Rit aaag Oca at orks + 7a 6 
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where 
ky = hf (ti, ws) 


ka =hf (4+ Zh, re =k: 


ke = hf (t+ Shut wit eh 5h) 
be = hf (ts + Shan + Sh ~ oka + hs 
ne (ud apes 2 cs 55 aoe 5.) 
snort wit ky ~ Bka + Tika ~ ke + hs) 
by (+ yg ~ _— + eh — Beak ~ deg + iene) 
kg = hf (+ thwit aah - 00, + ee ks = a5 ka 4 ee ks 4 ee kr) 


This method uses just eight new function evaluations per time step, as compared to 

the at least 12 evaluations that would be required of an arbitrary fifth-order, sixth- 
order combination. The truncation error estimate and the step size adjustment 
factor for the RKV56 method are, respectively, 


125 2p, 3 125 
= = eh - iro5a"3 + iag ru ro55 83 — gake + qig99 kr + age 
h 


1/5 1/5 
a (aa) = 0.87 (45) . 
2le| lel 


With the same parameter values as listed above (hmin = 0.01, hmax = 0.25 
and TOL = 5 x 1077), the RKV56 method was used to approximate the solution 
of 


and 


x(0) = 0, 


and the results are shown in Figure 7.14. This figure is structured similar to Fig- 
ure 7.18. The top left graph shows the approximate solution, the top right the 
global error, and the bottom left the evolution of the step size. Higher-order com- 
ponent methods allow the RKV56 method to use larger step sizes than the RKF45 
method. In total, the RKV56 method required 33 time steps and 288 function 
evaluations for this problem. 


Multistep Methods 


Variable step size algorithms can also be constructed from multistep methods. Local 
truncation error is still estimated by comparing approximations generated from two 
different methods; however, when working with multistep methods, the convention 
is to combine an explicit method with an implicit method of the same order, as 
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Figure 7.14 (Top left) Approximate solution of a! = —3tz? +1/(1 + 
t°), <(0} = 0 obtained using the RKV56 method. (Top right) Global 
error in appraxhnate solution. (Bottom left) Step size used by RKV56 
method as a function of location. 


in predictor-corrector schemes. Let wi,, be the approximation produced by the 
explicit method and ti;,, the approximation from the implicit method. If each 
method is of order a and {41 is the exact value of the solution of equation (1) 
subject to the initial condition §(t;) = w;—and we make the assumption that all 
other values used in the multistep equations are exact—then an analysis similar to 
that which led to equations (4) and (5) yields 


ry = BELTED = kh? + off) (6) 
and . ae 
ee cal —Set = kh® + o(h®). (7) 


Assuming that the o(h®) terms can be neglected and then subtracting equation (7) 
from equation (6) gives 


Wit) — Wit) i 
aw (eA, 
} ( ) 


Section 7.7 Error Control and Variable Step Size Algorithms 617 


Solving this last expression for h® and substituting into equation (7), we find that 
the truncation error for the implicit method is approximately given by 


2 ko Wig, — wis 
k—k h 


This process for producing a truncation error estimate is known as the Milne dewice. 
The formula for the step size adjustment factor, 


TOL 
= (Fr) 
is determined in the same manner as for one-step methods. The factor of 2 in the 
denominator is included to be conservative. 


As an example, suppose we construct a variable step size algorithm from the 
four-step, fourth-order Adams-Bashforth method 


aa ae = al [B5f (t;, wa) — 59f(te_1, wi-1) + 87 f (ta_2, Wi-2) — Of (te-3, Wi_3)] 
and the three-step, fourth-order Adams-Moulton method 

Wi — Wi 1 

a = yl [Of (teri, i421) + 19f (ts, wa) a 5f(ti-1, wi-1) + f(ti2, wi-2)] 


The local truncation error for the Adams-Bashforth method is 


251 
oe SEE 4g (5) 
T% 790 ¥y (§), 


where t; < € < t;4,. Since we can write 
y(E) = y(t) + [YC - ¥(E)| 
and we know that € = ¢; as h — 0, it follows that 


er nty) (t,) + o(h4). 


sae 


In a similar manner, we can establish that the local truncation error for the Adams- 
Moulton method is 


19 
7 = — a htyO a, ht), 
f= —eahty (te) + oh 
Therefore, for this pai SVG Vand pS = Oba ONG) doth 
erefore, for this pair of methods, k = 35y'°) (ts) an Fag y'” (ts), 80 that 
k -19/720 19 


bok =o Fi20- 270 
and 5 
ee 19 Wiy1 — Witt 


~ 790 h 
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Furthermore, the step size adjustment factor is 


TOL\ “4 -\ 1/4 
1 (Gft)"aom 70)" 
2le| le| 


since the component methods are both fourth order. 

When it comes to the implementation of a variable step size algorithm based 
on multistep methods, there are a few more details to take into account than there 
were with one-step methods. These are all related to the fact that multistep meth- 
ods require starting values at equally spaced points, so that any change in step size 
necessitates computing new starting values. Therefore, when |e| < TOL and q>1, 
we do not automatically increase the step size. We have to balance the expense of 
the function evaluations needed to obtain new equally spaced starting values with 
the savings attributed to taking fewer steps. For the fourth-order Adams methods, 
experience suggests increasing the step size only when |e| < 0.1- TOU. 

A second consideration involves determining the amount of work that needs 
to be redone when |c| > TOL. If the steps that preceded the one that just failed 
were used to compute new equally spaced starting values, then those starting values 
must be rejected along with the current step. We must return to the last accepted 
values and start over. On the other hand, if the preceding steps were ones computed 
with the multistep methods and accepted, we simply need to adjust the step size, 
compute new starting values, and continue. No prior results need to be rejected. 

The last issue to address is the final time steps, those that advance the ap- 
proximate solution to the final solution time, ty. Generally, it will be necessary to 
adjust the step size so that the solution time exactly reaches t+. Unfortunately, this 
means that several time steps will be needed to finish the solution process—-those 
associated with computing new equally spaced starting values plus one step with 
the multistep methods. Therefore, the step size needs to be set not to ty minus 
the current time, but to some fraction of this value, with the fraction determined 
by the number of starting values that need to be computed. For the fourth-order 
Adams methods, three new starting values must be computed, so we would set the 
step size to one-fourth the difference between ¢y and the current time. 

With the same parameter values applied earlier (Amin = 0-01, hmax = 0.25, 
and TOL = 5 x 1077), the approximate solution of the equation posed at the be- 
ginning of the section was computed using a variable step size Adams fourth-order 
predictor-corrector (VS_PC4) method, and the results are shown in Figure 7.15. 
The top left graph shows the approximate solution, the top right the global error, 
and the bottom left the evolution of the step size. Note that the step sizes used by 
this method are significantly smaller than those used by either the RKF45 or the 
RKV56 method. This translates into 25 times and 4 times as many time steps (133 
versus 48 and 33) as the RKF45 method and the RKV56 method, respectively. In 
terms of the number of function evaluations, which is a better measure of compu- 
tational effort, the difference between the three methods is less pronounced. The 
current algorithm required 370 evaluations, just 16% more than RKF45 and 28% 
more than RKV56. 
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Figure 7.15 (Top left) Approximate solution of 2’ = —3ta2? +1/(1+ 
i°), 2(0) = 0 obtained using the VS_PC4 method. (Top right) Global 
error in approximate solution. (Bottom left) Step size used by VS_PC4. 
method as a function of location. 


EXAMPLE 7.25 Some of the VS_PC4 Calculations 


As we did earlier with the RKF45 method, we will now examine some of the cal- 
culations which went into the construction of Figure 7.15. From fp = wo = 0, we 
attempt a step of length k = hmax = 0.25. To initialize the multistep methods, we 
use the classical fourth-order Runge-Kutta (RK4) method to calculate 


wy = 0.246142027, we = 0.444346535, and wy = 0.527057387. 
The Adams fourth-order predictor-corrector method then yields 
wa = 0.475846091 and — t, = 0.504652183. 
The resulting estimate for the local truncation error is 


ss 19 Wa — Ww, ms 
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Since |e| > TOL = 5 x 1077, we discard the work we’ve just done and start over 
with a smaller step size. The step size adjustment factor is found to be 


TOL\ 4 
q=(—<] =0.074516330, 
Ie 


but the smallest value we allow for q is 0.1. We will therefore reattempt the first 
time step with h = (0.1)(0.25) = 0.025. 
With to = wo = 0 and h = 0.025, we obtain 


wy = 0.024999609, wz = 0.049993751, and ws = 0.074968373 


from the RK4 method, followed by w4 = 0.099900078 and 4 = 0.099900103 
from the predictor-corrector method. The estimate for the local truncation er- 
ror is € = 7.0200 x 10-8. Since |e| < TOL, we accept w,, we, w3, and tq as 
approximations to #(0.025), 2(0.05), 2(0.075), and x(0.1), respectively, and ad- 
vance the independent variable to t, = tp) + 4h = 0.1. Though we’ve obtained four 
values for the approximate solution, since only one of these was obtained from the 
predictor-corrector method, this is considered to be one time step of the VS_PC4 
method. 

For the second time step we continue to use h = 0.025 because |e| was not 
smaller than 0.1. TOL. Working only with the predictor-corrector method, we 
find ws = 0.124756290, Ws = 0.124756344, and « = 1.5064 x 107”. This local 
truncation error estimate is below TOL, so we advance the independent variable to 
ts = 0.125 and accept x(ts) ws. The third and fourth time steps, both computed 
with h = 0.025, also produce sufficiently accurate results. In particular, we find 
x(0.15) & tig = 0.149495471 and x(0.175) = w7 = 0.174067140. 

Attempting another step of length h = 0.025, however, produces € = 5.6576 x 
10~", which is too large. The step size adjustment factor is g = 0.815316647, so 
we reduce the step size to h = (0.815316647)(0.025) = 0.020382916. The RK4 
method is used three times to initialize the predictor-corrector method, which is 
then used to produce a truncation error estimate. With the smaller step size, we 
find ¢ = 4.7438 x 10-7 < TOL. The accepted approximations are 


(0.195382916) ~ 0,193936444; 
2(0.215765832) © 0.213620059; 
2(0.236148748) ~ 0.233079328; and 
2(0.256531664) ~ 0.252272845. 


The independent variable is advanced to 0.256531664, and we continue until we 
reach t = 5. 


The results of this one test problem should not be taken to imply that the 
RKV56 method is always more efficient than the RKF45 method, which is always 
more efficient than the VS_PC4 method. There are problems for which the RKF45 
method will be the most efficient, other problems for which the VS_PC4 method 


Section 7.7 Error Control and Variable Step Size Algorithms 621 


will be the most efficient, and still others for which all three algorithms will perform 
roughly the same. In fact, for the test problem we've been examining in this section, 
if we were to advance the independent variable to t = 10, we would find that the 
VS_PC4 method was the most efficient in terms of required function evaluations, 
followed by the RKF45 method and the RKV56 method, in that order. 
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EXERCISES 


1. Perform three steps of the RKF45 method for each of the following initial value 
problems. Take Pmin = 0.0001, Amax = 0.25 and TOL =5 x 1077. 
(a) 2 =1l+2/t, x)=1 
(b) 2’ =t/z, z(0)=1 
(c) 2 =2x(1—2), 2x(0) = 0.25 
(d) 2 = 1.094+22—2°, 2(0) = —-1.5 
(e) a! =—227 +1/(1+4*), 2(0) =0 
2, Repeat Exercise 1 with the RKV56 method. 
3. Repeat Exercise 1 with the VS_PC4 method, but change hmax to 0.15. 


4, Suppose we were to construct a variable step size algorithm from the following 
embeddded pair: The third-order method 


1 3 3 
D5 =wWi = = <k 
Witt wit Gh t shat 4 


is used to approximate the local truncation error in the second-order method 
1 3 
Wii = wit ra + qhe 
where 


ky = Af (ti, wi) 
2 2 
kg = hf (t. + oh, wi + zhi) 


3 3 
2 , 
kg = hf (ti + Shy wi + sho) 


(a) In terms of k1, kg and kg, what is the formula for the local truncation error 
estimate? 


(b) What is the formula for the step size adjustment factor q? 
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5. Repeat Exercise 1 with the variable step size algorithm described in Exercise 4, 
but change TOL to 5 x 1074, 


6. Suppose we were to construct a variable step size algorithm from the two-step 
Adams-Bashforth method and the one-step Adams-Moulton method (the trape- 
zoidal method). 

(a) Using the Milne device, what formula would we use to estimate the local 
truncation error in the one-step Adams-Moulton method? 
(b) What would be the formula for the step size adjustment factor g? 


(c) What would be an appropriate choice for the method to initialize the pre- 
dictor-corrector? 


7. Repeat Exercise 1 with the variable step size algorithm described in Exercise 6, 
but change TOL to 5 x 107+ and change Amax to 0.15. 


2 


Suppose we were to construct a variable step size algorithm from the three-step 
Adams-Bashforth method and the two-step Adams-Moulton method. 


(a) Using the Milne device, what formula would we use to estimate the local 
truncation error in the two-step Adams-Moulton method? 
(b) What would be the formula for the step size adjustment factor q? 


(c) What would be an appropriate choice for the method to initialize the pre- 
dictor-corrector? 


In Exercises 9-11, use the RKF45 method to approximate the solution of the given initial 
value problem using the indicated parameter values. Plot the approximate solution and 
the step size selected by the algorithm as a function of the independent variable. Note 
any unusual behavior in the step size. 


9. x’ = 100(1—z) (0<t<2), 2{0)=0, Amin =0.0001, hmax = 0.25, TOL = 
5x 1077 


10. 2! = —50(z — cost) (0 < t < 10), (0) = 1, Amin = 0.001, Amax = 
0.2, TOL=10~4 


ae 1432-2) (0<t<4), 2(0)=1, hmin = 0.0001, hmex = 0.25, 
TOL=5x107? 


In Exercises 12~18, compare the performance (as measured by the number of function 
evaluations) of the RKF45 method, the RKV56 method, and the VS_PC4 method for 
the given initial value problem. Use the specified parameter values for each method. 


12. a =e" (0<t<0.9), 2(0)=0, Amin =0.001, hmax = 0.25, TOL= 10-§ 
13. 2 =a-041 (O<tS 2), 2(0)=0.5, Amin = 0.01, Amax=0.5, TOL = 


5 x 1076 
14, g! = —3tx? + 1/(t +3) (0<#<10), 2(0)=0, Amin =0.01, hmox = 0.25 
TOLLS S510" 


15. 2 =1.094+2¢—2° (0<t< 60), 2(0)=-3, hmin=0.0001, Amax = 0.5 
TOL =5x 10° 
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16. The “Spread of an Epidemic” Problem from Section 7.2 


a =m[N -—D—Hpexp(-cD/m)| (0<t<15), D(0)=0 


m = 1.8 week™', c= 0.001 (person - week)? 
N = 3000 people, Ho = 2850 people 
Amin = 0.1 weeks, fimax = 2 weeks, TOL =5x 107° 
17. The “Contrast Enhanced Lithography” Problem from Section 7.5 


dM 
<< = Mc ae — M2) + B(1— Mz) — Beln Me| (0<2<04) 


M-(0) = exp(—ICct) 
a=14A,, B=—0.06Ac, Ae = 8.93/um 
Be =0.175/m, Ce = 0.0376 cm? /mJ, 
It=110 mJ/em?, Amin =0.00lpm, Amax = 0.05um, TOL = 107° 


18. The “Genetic Switch” Problem from Section 7.4 


dg 9 
29 _ 9.206 +3. ef St< = 
7 +3.087T 5 —15lg (0S +S 100), 9(0) =0 


Amin = 0.1, Amax oa 5, TOL Ele i078 


7.8 SYSTEMS OF EQUATIONS AND HIGHER-ORDER EQUATIONS 
Systems of Equations 


Many of the phenomena that interest scientists give rise not to a scalar initial value 
problem but rather to a system of initial value problems. To study the dynamics 
of interacting populations, (at least) one equation is needed for each of the relevant 
populations. Even in the simplest models of two populations interacting either in 
a predator-prey or a competitive relationship, a system of equations of the form 


dx 
— = 012; — bya? + 12179 
dt 
dx 
ai = AnF2 — board + C2%1 22 


must be studied. The system of three equations 


aL 2 
y= ~36L+12[P0- P*)— J] 
dP 2 

—— =-1.2P+6/L+—; 

di ss ( +7Ez) 
az 


a —0.12Z + 12P 
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has been used to model the emotional and inspirational cycle of a fourteenth-century 
Italian poet (see Exercise 16}. Larger systems are also easy to find. For example, 
the study of combustion dynamics can give rise to extermely large systems. In 
addition to equations for the relevant thermodynamic properties, there could be 
hundreds, or even thousands, of species involved in the chemical reactions taking 
place, and each species requires its own differential equation. 

Those who have studied systems of differential equations know that the anal- 
ysis of systems generally requires more advanced mathematical techniques than 
the analysis of scalar equations. Systems of equations also exhibit a wider variety 
of possible behaviors than do scalar equations. Fortunately, however, as we will 
see below, the numerical analysis of systems requires little more than a notational 
change from the scalar case. The derivations of the methods will be identical, and 
the advantages and disadvantages of the various methods will be unchanged. 

The most general system of m first-order initia] value problems can be written 
in the form 


uh (t) = filt, ua, Ue,us,---,Um) ur (a) = ay 
uh (£) = fa(t, Ui, Wa, Ug, ++) Um) ue(a ) = a 
ug(t) = f(t, ur, U2, U3,---)Um) u3(a }5 Ss 
Usn (E) = Sin(E, Ur, Ue, Us, tee Um) Um (a) = Am, 


where ¢ is the independent variable (we will once again assume that we are interested 
in the range a <i <b) and w1, U2, Ug,-.-, Um are the dependent variables. This 
system is most efficiently represented ising x vector notation. Toward that end, let 


Uy (£) Oy 
ug(t) a2 
u(t) = | vst) ee 
Um{t) Om 
and 
Silt, ur, U2, Us, tee up) 
fa(t, ur, U2, Us, tee Um) 
f(t,u) = | fa(t tate, ts,---s Um) 
Fin(t, U1, UZ, U3,--+5 Um) 
Then 
uj (t) 
ws) 
u(t) = u(t) |. 
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and the original system of equations can be written very compactly as 


u(ij=f(t,u), a<t<b 


Once we have recognized that a system of first-order differential equations 
can be written in a form similar to that of a scalar first-order equation, but with 
scalar notation replaced by vector notation, we see that all of the key concepts 
and definitions (such as local truncation error and global discretization error) and 
all of the algorithms we have spent so much time developing carry over directly 
from the scalar case. We simply have to replace all of the scalar notation with the 
appropriate vector notation. 

For example, Euler’s method takes the form 


Wo = a 
Wit. — Wi 


h + = f(ti,wi) 


in the scalar case; whereas, for a system of equations, Euler’s method takes the 
form 


w =a 


h 
Note that in the vector notation, we have used a superscript to denote the time step 
so that subscripts can be used in the conventional manner to denote components 
within a vector. As a second example, the two-step Adams-Bashforth method for 
a scalar equation is 


Wi4+) — Wi 3 1 
a = gi (ti, wi) - af (ti-1, wi-1). 


For a system of equations, the method equation reads 


that is, the method retains the exact same functional form but with vector quantities 
substituted for the appropriate scalar values. 

Although the concepts, definitions, and derivations of the various numerical 
methods are unchanged when dealing with systems of equations, there are cer- 
tain programming considerations that must be handled. First, the user-supplied 
function that computes the values of f must take an array of input values for the de- 
pendent variables, in addition to the value of the independent variable, and return 
an array of values. Second, the code must contain a loop to update each component 
of the w vector, not just a single value. Next, for those methods that make use of 
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intermediate values, such as the RK4 method, or that save previous function values 
for efficiency reasons, such as the Adams-Bashforth and Adams-Moulton methods, 
intermediate storage may be needed. This issue is, of course, heavily dependent 
on. the choice of programming language. For adaptive schemes, like the RKF45 
method, a norm must be used to compute the error estimate. Finally, Newton’s 
method for systems of nonlinear equations will be needed for implicit schemes like 
the trapezoidal method. These are the only changes needed to translate the code 
for a single equation to handle a system of equations. 


Worked Examples: Systems of Equations 


EXAMPLE 7.26 Cooling of a Container and its Liquid Contents 


Following an experiment, a small container and its liquid contents are at a tem- 
perature of 150°F. To cool both the liquid and the container to room temperature 
(70°F), the container is immersed in a bath held at 32°F. To model the cooling of 
both the container and the liquid, let L denote the temperature of the liquid and C 
denote the temperature of the container. Balancing the rate of change of energy 
storage in the liquid and the container with the rate of convective heat transfer 
(between the liquid and the container and between the container and the bath) 
leads to the coupled system of differential equations: 


dL Ayh 

— =———_(C-L 

dt nea ) 

dC Agh Ayh 

— = —*— (32 --C) + L-C). 
dt aan ) ieee ) 


The parameter values for this system are summarized in the table below. 


Liquid Container 
Mass density p, = 62 lb», /ft? fo = 139 lbp, /ft? 
Specific heat cp, = 1.00 Btu/Ibm -°F  cp,2 = 0.2 Btu/lbm «°F 
Volume v; = 0.08 ft? ve = 0.003 ft? 


The convective heat transfer coefficient, h, is assumed to be 8.8 Btu/hr - ft? - °F, 
and the surface contact areas are A; = 0.4 ft? and Ag = 0.5 ft?. 

Anticipating a rapid initial temperature drop for the container, this model 
was simulated using the RKF45 method with a minimum step size of 0.0001, a 
maximum step size of 0.2, an error tolerance of 0.0001 per unit step, and the infinity 
norm to measure the error estimates. The resulting temperature profiles are shown 
in the graph in Figure 7.16. The container reaches room temperature in roughly 
15 minutes, but it takes slightly more than an hour for the liquid temperature to 
reach 70°F, 


ee ee 
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Figure 7.16 Temperature profiles for Example 7.26. 


EXAMPLE 7.27 A Catalyzed Reaction 


In the Overview to this chapter (see “A Catalyzed Reaction,” page 535), we devel- 
oped a model for a catalyzed chemical reaction between two reactants, which we 
called A and B. The model originally consisted of five equations, which we reduced 
to the system 


a=-a(A-B+a), a0) 
B' = —KB(6 — a), 6(0) $1. 


Here, a and @ denote the normalized concentration of A and B, respectively, » is 
the ratio of the initial concentration of the catalyst to the initial concentration of A, 
and « is the ratio of the reaction rates of the two pathways by which the reaction 
takes place. ; 

Let’s fix « = 1 and examine the effect of \ on the dynamics of the reaction. 
We will investigate the effect of « on reaction dynamics in the exercises. For \ = 10, 
i, 0.1, and 0.01, the model equations were solved using the RKF45 method with 
parameter values Amin = 0.001, Amax = 0.5 and TOL = 5x 107°. The independent 
variable was advanced from t = 0 to t = 100, 150, 200, and 1000 for the respective 
values of A. Since the presence of the catalyst is supposed to speed up the reaction, 
it should not be surprising that larger catalyst concentrations (larger values of A) 
cause the reaction to reach equilibrium more rapidly. The results of the calculations 
are shown below in Figure 7.17. 

Apparently, the value of A influences the relative rates at which the reactants A 
and B are depleted during the reaction. When 4 is “small,” the solution trajectories 
remain near the a = @ line of the phase plane, indicating that the two reactants 
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Figure 7.17 Influence of \ on the dynamics of a catalyzed reaction. 


are being depleted at roughly the same rate. For larger A, however, we observe 
that reactant A is consumed more rapidly than reactant B and that the disparity 
between the depletion rates grows with increasing A. 

a ee te 


Higher-Order Equations 


Consider the mth-order initial value problem 


ye) = fliyyi ys. g) 
y{a) = ay, y'(a) = a, ya) = 4g... ya) = Am. 


Any mth-order initial value problem can always be reduced to a system of m first- 
order equations by the introduction of an appropriate set of variables. Once this 
conversion has been performed, any of the numerical techniques that we have dis- 
cussed in this chapter and have extended to systems of equations earlier in this 
section can be used. 

The reduction to a system of equations begins with the definition of the in- 
termediate variables 
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ug(t) = y’(t) 


ttn (t) = y(t), 


These definitions, in turn, imply 


Urn (t) = y(t) = f(t, ur, Ua, Ua,---, ttm), 
together with the associated initial conditions 
Uy (@) = Oo, ue(a) = 2, U3(@) = O3,...,Um(2) = Om. 


In vector form, the original mth-order equation can now be written as the system 
of m first-order equations 


u(t) = f(t, u(t), 


where 
ur (t) ug(t) 
ue(t) 13 (t) 
u(t) = | 4) and f(t,uy= | (2) 
Um (t) fu) | 


Worked Examples: Higher-Order Equations 


EXAMPLE 7.28 The van der Pol Equation 


A famous, nonlinear second-order differential equation is the van der Pol equation: 
a” +e(e? -1)2'+2=0, 


where € is a positive constant. One can think of this equation as a model for a 
mass-spring-damper system with nonlinear damping. Since the damping coefficient, 
e(az? — 1), is positive for |z| > 1, energy is drained from the system when the 
amplitude of the motion is large, which tends to decrease the amplitude of the 
motion; however, for |z| < 1, the damping term becomes negative, implying that 
energy is supplied to the system. It seems reasonable, therefore, to expect that the 
solution will exhibit some sort of nonlinear oscillatory, or limit cycle, behavior. 
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Component Graphs 


Phase Portsait 


A . Position, x 


Figure 7.18 Solution to van der Pol equation with « = 4. 


We would like to investigate this behavior numerically. To accomplish this, 
we first convert the second-order differential equation into a system of first-order 
equations. Define the intermediate variables 


U= 
Ug = 2"; 
the resulting system then takes the form 


! 
Uy = U2 


uy = €(1 — u2)ug — uy. 


Let’s take « = 4 and apply the initial conditions 2(0) = 0.75, 2’(0) = 0. Using the 
RKF45 method with a minimum allowed step size of 0.01, a maximum step size 
of 0.5, an error tolerance of 0.0001 per unit step, and the infinity norm to measure 
error estimates, the results shown in Figure 7.18 are obtained. The limit cycle 
behavior of the solution is clear. From the component graphs, the period of the 
nonlinear oscillation appears to be slightly more than 10 time units. In the phase 
portrait, motion is clockwise around the limit cycle. 


EXAMPLE 7.29 Bubble Dynamics 


Lithotripsy is a medical procedure whereby brief, intense ultrasound pulses are used 
to break kidney stones into small enough pieces that can be passed naturally. The 
lithotripsy pulses induce cavitation, and the oscillations in the radius of the resulting 
bubbles play a role in the break down of the kidney stones. Howle, Schaeffer, 
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Shearer, and Zhong (“Lithotripsy: The Treatment of Kidney Stones with Shock 
Waves,” SIAM Review, pp. 356-371, 1998) investigate the free vibrations of these 
cavitation induced air bubbles. The model they consider is the second-order initial 
value problem 


4 13h Reve : 
RR+ 5k = a? (#) = 1 R(0) a Ao, R(0) =0, 

where dots denote differentiation with respect to time. 2 is the time-varying radius 
of the air bubble, Ro = 3 x 10-° meters is the equilibrium radius of the bubble, 
- = 1.4 is the adiabatic exponent, and a = \/Datm/p * 10 m/s. 

Note the large difference in order of magnitude between the value of Rg and 
that of a. Whenever system parameters vary in size by many orders of magnitude, 
it is best to nondimensionalize the variables, taking into account the natural scales 
of the problem, before running simulations. For this model, Rp is the most obvious 
choice for the length scale. As for the time scale, note that Ro/a has dimensions of 
time. We will therefore introduce the nondimensional radius and nondimensional 
time variables 


r=R/Ryo and 7r=(a/Ro)t 


into the original initial value problem, producing 
rr + ; (r’)? = [r97 - 1 r(0) =A, 7'(0)=0, 


where primes denote differentiation with respect to the variable r. 
To solve this initial value problem, we next introduce the intermediate vari- 
ables 
y=r and y=dr/dr, 
which transform the second-order differential equation into the system of first-order 
equations: 


yi, =, yi(0) = A 
6= (yy? -1-1.5y3)/t1, — yo(0) = 0. 


Take A = 2.5. Using the RKV56 method with a minimum step size of 0.0001, a 
maximum step size of 0.05, and a tolerance of 0.00001 per unit step produces the 
results summarized in Figure 7.19. The period of the oscillation is roughly 4.7 time 
units, which translates to a dimensional period of approximately 1.4 microseconds: 
(4,7)(3 x 107© m)/(10 m/s). 


EXERCISES 


1. Advance the solution of each of the following systems through N = 4 time steps 
using Euler’s method and a step size of h = 0.25. 
(a) z =y, y =-2-2y, 2(0)=2, y(0) = -1 


25 


in 


Nondimensional radius, ra Ad Ay 


2 
in 
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Figure 7.19 Radius of cavitation induced bubbles curing lithotripsy. 


(b) a =ay-1, y =2-y*, 2(0)=1, y(0)=0 
(c) 2 =22+4+2z, y =Qytz, 2? =z, 2(0)=1, y(0)=—-1, 2(0)=1 
(d) @ =y-2, y'=32-y-az, 2’ =a2y-z, 2(0)=y(0)=2(0) =1 


. Repeat Exercise 1 using the second-order Runge-Kutta method of your choice. 
. Repeat Exercise 1 using the classical fourth-order Runge-Kutta method. 

. Repeat Exercise 1 using the two-step Adams-Bashforth method. 

. Repeat Exercise 1 using the four-step Adams-Bashforth method and N = 6 time 


steps. 


. Repeat Exercise 1 using the Adams fourth-order predictor-corrector and N = 6 


time steps. 


. Convert each of the following nth-order differential equations to a system of 


first-order equations. 

(a) 2!” + 4x" + 52’ =0 
(b) 2 =a 

(c) gill! 4 saz" =0 

(d) 2!” =2' In(x") + sin() 
(e) 2” +sin(zz") =1 


(f) 2 = cos(2’”’) — Vax” 


. Convert each of the following to a system of first-order differential equations. 


(a) oc’ +sin(c)=1—2', y +sinfy)=1—2, 2" tz) +z=a2'4+y’ 
(b) 2” = (2')? +n-sin(t), y” = /9-ty’ 


(c) o =y—a, yy’ =x2-Qytz, 2 =y-—2z 


. Advance the solution of each of the following systems through N = 4 time steps 


using Euler’s method and a step size of h = 0.25. 
(a) 2! + $22" =0, 2(0) =0,2'(0) = 0,2"(0) = 1 


10. 
11. 
12, 
13. 


14. 


15. 


16. 
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(b) 2” +42” +52’ =0, 2(0) =1,2'(0) =0,27(0) = —1 

(c) 2” +(e? —1)a’+2=0, x(0) =0.5,2'(0) =0.1 

(d) 2” + 4e'+a =sin(t), 2(0) =1,2'(0) =0 

Repeat Exercise 9 using the second-order Runge-Kutta method of your choice. 
Repeat Exercise 9 using the classical fourth-order Runge-Kutta method. 
Repeat Exercise 9 using the two-step Adams-Bashforth method. 

Repeat Exercise 9 using the four-step Adams-Bashforth method and N = 6 time 
steps. 

Repeat Exercise 9 using the Adams fourth-order predictor-corrector and N = 6 
time steps. 


To help understand the long term effects of stress on plants and animals, the 
following food-limited population model has been proposed (Nisbet, McCauley, 
Gurney, Murdoch, and de Roos, “Simple Representations of Biomass Dynam- 
ics in Structured Populations,” in Case Studies in Mathematical Modeling— 
Ecology, Physiology and Cell Biology, Othmer, Adler, Lewis and Dallon, eds., 
Prentice Hall, 1997, pp. 61-79): 


dF. Imax C 

dt Fi, 

dC €lmaxFC 
ee b)C. 
i Pee 


Here, F(t) denotes the food biomass density, and C(t) denotes the herbivore 
(consumer) biomass density. © represents the net rate of food supply to the 
model. The various model parameters include 


Imax = 6.5 day} maximum specific ingestion rate; 
Fy = 0.98 mg/L half-saturation constant; 
€ = 0.75 assimilation efficiency of herbivore; 
m = 0.04 day} death rate of herbivore; and 


b = 0.23 day~” respiration rate of herbivore. 


Simulate 30 days of the biomass dynamics for a constant food supply rate of 
@ = 0.125 mg/L - day. Take F(0) = 0 and C(0) = 0.1. 

S. Rinaldi (“Laura and Petrarch: An Intriguing Case of Cyclical Love Dynamics,” 
SIAM J. Appl. Math., 58, pp. 1205-1221, 1998) presents the following model 
for the emotional and inspirational cycle of the fourteenth-century Italian poet 
Petrarch: 


dL 2 
B= BOL +42 [Pa P*)—1] 
dP 2 

<= -12P +6(L+ 5) 

az 


ae ~0.12Z + 12P. 
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17. 


18. 


19. 


20. 
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Here, L represents the love of Laura {a beautiful woman who was Petrarch’s 
inspiration) for Petrarch, P represents the magnitude of Petrarch’s love for Laura, 


and Z represents the poet’s inspiration level. Time is measured in years. Starting 
from the initial conditions 


L(0) = P(0) = 2(0) =0, 


sinsulate 21 years of Petrarch’s emotional cycle. Display your results as functions 
of time and in the P— L and Z — P phase planes. 


Neglecting the effect of air resistance, the motion of a pendulum can be modeled 
by the second-order initial value problem 


L6"+gsind=0, (0) =60, 6'(0) =0, 


where @ denotes the angle which the pendulum rod makes with the vertical, L 
is the length of the pendulum rod, and g is the acceleration due to gravity. For 
this problem, take L = 1 meter and g = 9.8 m/s?, Estimate the period of the 
pendulum for values of @9 ranging from 0.1 through 2.0 in increments of 0.1. 


Recall the catalyzed reaction problem considered earlier in this section: 
a’ =-a(A-B+a),  a(0)=1 
6 = ~KG(B - a), B(0) =1. 
Fix A = 1, and investigate the effect of « on the reaction dynamics. Consider 


« = 10, 1, 0.1, and 0.01. 
Recall the van der Pol equation considered earlier in this section: 


a” +¢(2* —1)2' +2 =0. 


Using the initial conditions «(0) = 0.75 and 2’(0) = 0, estimate the period of 
the motion for values of ¢ ranging from 2.0 through 6.0 in increments of 0.25. 
Howle, Schaeffer, Shearer, and Zhong (“Lithotripsy: The Treatment of Kidney 
Stones with Shock Waves,” SIAM Review, pp. 356-371, 1998) also investigate 
the direct effect of ultrasound pulses on the cavitation induced air bubbles. In 
this case, the model they consider is the second-order initial value problem 


by . 
3 [ras $e] = (8) —feolt), —-R(0) = Ro, R(0) = 0, 


where dots denote differentiation with respect to time and 


Soo(t) = 2000e7*/7" cos(t/72 + /3) +1, O<t < (7/6)t9 
Boo(t) = 1 otherwise. 


y 


Here, 7 is the time-varying radius of the air bubble, Ro = 3 x 107° meters is 
the equilibrium radius of the bubble, y = 1.4 is the adiabatic exponent, and 
a = \/Patm/p ~ 10 m/s. The time constants in Poo(t) are 7) = 1.1 x 1076 
seconds and rz = 1.9x 10° seconds. Estimate the period of the bubble motion. 
Remember to introduce nondimensional radius and time variables, and don’t 
forget to nondimensionalize the time constants 7; and 79. 
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21. In a study of nonlinear spatial developments of two-dimensional wall jets on 
curved surfaces, Le Cunff and Zebib (“Nonlinear Spatially Developing Gortler 
Vortices in Curved Wall Jet Flow,” Phys. Fluids, 8, pp. 2375-2384, 1996) require 
the solution of the initial value problem 


mi ” 12 } u 95/2 
z + {lca +2(2)°)=0, 2(0)=2(0)=0,2 Ca 


Approximate the solution of this problem from ¢ = 0 to t = 20. 
22. In the Overview to this chapter (see page 534), we developed the model 


1 — ug 
Se eT ip = '(Q) = 
Ba ~S (eZ), ylo)= tu, v= 2 
for the growth of a population in a closed system. Here, y is the natural logarithm 
of the population, ug is the initial population, and « is a dimensionless parameter. 
Investigate the evolution of the population from tf = 0 tot = 5 for x = 0.05, 0.1, 
0.25, and 0.5. Take ug = 0.1. 


23. D. Winter (“On the Stem Curve of a Tall Palm in a Strong Wind,” SIAM Review, 
35, pp. 567-579, 1993) develops the following model for the stem curve of a palm 
tree subject to wind loading: 


sind + ? eab 


C8 Ws (Pa We 

ds? EI LW; EI 
& = sind 

dz 

a = 0088. 


The variables in the problem are the angle of the stem relative to the vertical posi- 
tion, 6, the arc length measured along the stem, s, the horizontal displacement of 
the stem, x, and height of a location along the stem, z. Both ¢ and z are treated 
as functions of s. The parameters are the total stem weight W, = 22700N, the 
Young’s modulus of the stem E = 6 x 10° N/m?, the moment of inertia of the 
stem J = 5.147 x 1074 m4, the length of the stem L = 30 m, the total canopy 
weight W. = 1385.5 N, and the wind drag force on the canopy D = 4.135U? N, 
where U is the wind speed in m/s. With initial conditions of 


6(0) = 6'(0) = 2(0) = 2(0) = 0, 


simulate the stem curve of a palm tree subject to a wind speed of 18 m/s. 


7.9 ABSOLUTE STABILITY AND STIFF EQUATIONS 
An Example to Motivate the Discussion 


A chemostat is a completely mixed, continuously stirred tank reactor used for grow- 
ing cells—such as yeast or bacteria—in a controlled environment. The chemostat 
consists of a large vessel with a mechanical stirring paddle. An excess of nutri- 
ents are provided for the growth of the cells, with the exception of one controlled 
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medium, known as the substrate. A tube leading into the chemostat provides a 
controlled flow of substrate, along with chemical compounds needed to maintain 
an environment favorable to cell growth. A second tube, leading out of the vessel, 
is used to siphon fluid from the chemostat so that cells can be harvested. The flow 
rates into and out from the chemostat are held equal so as to maintai constant 
volume within the vessel. 

Boyd and Wang (“Optimizing Cell Production in a Chemostat,” in Proceed- 
ings of Mathematical Modeling in the Undergraduate Curriculum, H. Skala, editor, 
University of Wisconsin/La Crosse, 1994) present the following model for the dy- 
namics of a chemostat: 


ds _ msx 

dt 7 “ats (1) 
dx msz 

ae es (2) 


Here, s is the nondimensional mass of substrate within the chemostat per unit vol- 
ume, and « is the nondimensional cell mass per unit volume. The parameters in 
this system, m and a, denote the maximal cell growth rate (due to nutrient con- 
sumption) relative to the volumetric flow rate into and out from the chemostat and 
the half-saturation level relative to the inflow substrate concentration, respectively. 
The half-saturation level is the concentration of substrate at which the per-capita 
cell growth rate achieves half its maximal value. 

The chemostat equations, (1) and (2), possess two equilibrium solutions. One 
equilibrium solution is located at the point (1, 0), and the other is located at 


CT a ak ree 
(ret)= (S51 ati): 


Of course, the second equilibrium solution is physically meaningful only for m > 1 
and a <™m-—1. Suppose m = 16 and a = 0.25. Then 


way af oe 82 
(s 2)= (355): 


Linear stability analysis about this point indicates that the solution is asymptoti- 
cally stable with characteristic exponents A; = —l and A. = —55.31. This means 
that near the point (s*,«*), the solution of (1) and (2) behaves like 


a(t) | | -t -85.31ty, 
) 3 |=] [see V1 + C2€ 2) 


for some constants c, and cz, where v; and ve are the eigenvectors associated with 
the characteristic exponents. 

The top graph of Figure 7.20 displays the approximate solution to equa- 
tions (1) and (2) with m = 16 and a = 0.25 and initial conditions s(0) = 0.5, 
and x(0) = 0.02. The solution was computed using the RKF45 method with 
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Figure 7.20 Solution of equations (1) and (2) with m = 16 and a = 
0.25 obtained using the RKF45 method. (Top graph) Time evolution of 
substrate and cell mass concentrations within the chemostat. (Bottom 
graph) Evolution of step sizes used to compute approximate solution. 


hmin = 0.001, Amax = 0.5, and TOL = 5x 107". As expected, the solution exhibits 
fairly rapid convergence toward equilibrium. What is not expected, however, is 
the time evolution of the step sizes used to obtain the approximate solution, as 
displayed in the bottom graph of Figure 7.20. For t > 2, both components of the 
solution remain essentially constant, with changes in the solution occurring on a 
time scale on the order of 1—the reciprocal of the absolute value of the less nega- 
tive characteristic exponent \,. Despite the observed behavior in the solution, the 
time steps selected by the RKF45 method hover around the 0.04-0.06 range, which 
is roughly on the same order as the time scale associated with the more negative 
characteristic exponent Az (~ 1/55.31 = 0.018). 

The results contained in Figure 7.21 are equally unexpected. The top graph 
shows the approximate solution of the chemostat equations computed using the 
classical fourth-order Runge-Kutta (RK4) method with a step size of h = 0.05. The 
parameter values and initial conditions are the same as above: m = 16, a = 0.25, 
s(0) = 0.5, and 2(0) = 0.02. The bottom graph shows the solution computed 
with the RK4 method and a time step of h = 0.06. The solution computed with 
h = 0.05 is essentially identical to the solution obtained with the RKF45 method. 
With h = 0.06, however, the approximate solution initially evolves much like the 
previous solutions, but sometime after t = 2 breaks down and settles onto a long- 
term behavior which is quite different from the asymptotic nature of the exact 
solution. 
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Figure 7.21 Solution of equations (1) and (2) with m = 16 anda = 
0.25 obtained using the RK4 method with A = 0.05 (top graph} and 
bh = 0.06 (bottom graph). 


Figures 7.20 and 7.2] raise several important questions. Why does the RKF45 
method select such small step sizes to approximate a nearly constant solution? 
Furthermore, why is there so much variation in the selected step sizes? With 
regard to the RK4 method, why did such a small change in the step size cause 
the approximate solution to fail to maintain the asymptotic character of the exact 
solution? Finally, why do the step sizes required by these two methods to obtain 
accurate approximations appear to be determined by the most rapidly decaying 
component of the solution? As we will now establish, the answers to these questions 
are related to the ability of the difference equations which define the numerical 
methods to approximate decaying exponentials. 


Test Problem 


To investigate the issues raised by Figures 7.20 and 7.21, consider the standard test 
problem 
yo = Ay, 


where A is complex valued with Re(+) < 0. This is the simplest differential equation 
that produces exponential solutions, and the real part of the parameter \ completely 
controls the time scale of the decay in the solution. We allow \ to be complex valued 
in order for our analysis to be applicable to systems of differential equations, in 
which case 4 would represent an eigenvalue of the Jacobian associated with the 


Section 7.9 Absolute Stability and Stiff Equations 639 


right-hand side of the system. Recall that the Jacobian of a vector-valued function 
is a matrix of partial derivatives (see Section 3.10) and that the eigenvalues of a 
real matrix can be complex. 

The next step is to apply the numerical methods developed earlier in this 
chapter to the test problem. The objective is to determine what restrictions, if any, 
are placed on the step size h to guarantee that the asymptotic character of the 
approximate solution matches that of the analytical solution. Since the analytical 
solution of the test problem decays to zero as t > 00, we will want the approximate 
solution of the test problem to decay to zero as n —> co, When the asymptotic char- 
acter of the approximate solution produced by a given numerical method matches 
that of the analytical solution, the numerical method is said to be absolutely stable. 


One-Step Methods 


The one-step methods introduced in Sections 7.2, 7.3, and 7.4 fell into two cate- 
gories: Taylor methods and Runge-Kutta methods. Our analysis of absolute sta- 
bility will be simplified, however, by the fact that the right-hand side of the test 
problem is linear in both the independent and the dependent variables. In this case, 
all of the Runge-Kutta methods reduce to the corresponding Taylor method of the 
same order, We can therefore restrict attention to the Taylor methods. 
The mth-order Taylor method is defined by the difference equation 
h? df hm qm—l f 


Wn = Wai + hf(tn—1,Wn-1) + F gp inch Un) hint al qimat (en=1) Wn—-1)- 
For the test problem under consideration, f(t, y) = Ay, so 


af aw) _ dy 
dee = dtk dt 


Since y' = Ay, it is straightforward to establish that y = A*y, and therefore 
af _ 
dtk ~~ 

Substituting this last expression into the equation for the Taylor method yields 


Netty, 


h2 2 hm 
Wy = Wn + hAwWy-1 + 5 Mwaai toot a NX Wri 


1 1 
2 m! 
Now, let ; ; 
2 
Q(z) = 1l4t2z2+ 57 Pets 
Then the Taylor method applied to the test problem becomes 
Wn = (Q(AA)] wn. (3) 
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Figure 7.22 Regions of absolute stability for the Taylor/Runge-Kutta 
methods of order one (Euler’s method) through four. 


The solution to the elementary difference equation, (3), is given by 
Wn = [Q(AA)]" wo; 


hence, w,, will decay toward zero with increasing n, matching the asymptotic char- 
acter of the exact solution of the test problem, provided that |Q(h,)| < 1. The set 
of all values of hA for which |Q(hA)| < 1 is called the region of absolute stability 
for the given one-step method. 


Definition. The REGION OF ABSOLUTE STABILITY for a one-step method 
is the set 
R= {hrX€EC| |Q(AA)| < 1}. 


Figure 7.22 displays the region of absolute stability associated with the order 
one through four Taylor/Runge-Kutta methods. Remember that Euler’s method 
is the first-order Taylor method. Two observations can be made from this figure. 
First, the regions of absolute stability grow with increasing order of the method. 
This should not be surprising considering that with increasing order, more terms in 
the Taylor expansion of the exact solution are retained, so the method provides a 
better approximation to the exponential function. The second observation is that 
the regions are of finite size. Thus, for a given A, absolute stability considera- 
tions will place an upper bound on the step size that can be used to compute the 
approximate solution. 
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How does this information help to explain the phenomena that were observed 
in the numerical solution of the chemostat problem at the beginning of the section? 
Consider a fourth-order Runge-Kutta method, and suppose that A is real. For a 
fourth-order Runge-Kutta method, 


2 Date Eine «lcd 
Q(z) =l+2z+5z + ge toe 


so absolute stability requires 
1 1 1 
1+hA+ =(hA)? + =(hA)® + —(Ad)4| <1. 
+ 5( ) ar ) + 37 (ha) <1 
The solution of the above inequality is 


—2.78538 < hA <0, 


meaning that we must choose h < —2.7853/A. Recall that there were two different 
values of for the chemostat problem. Since A is inversely proportional to A, it 
follows that the upper bound on A will be derived from Az = —55.31, which is larger 
in magnitude than A; = —1. This state of affairs will hold true in general. When 
a problem possesses multiple time scales, the product 2 must lie in the region of 
absolute stability for all values of A. The maximum allowable time step based on 
absolute stability considerations will therefore be dictated by the A that is largest 
in magnitude. Returning to the specifics of the chemostat problem, we see that 
absolute stability requires 


—2.7853 
Ag 


Now reexamine Figure 7.21. With h = 0.05 < hy, the RK4 method is abso- 
lutely stable, but with h = 0.06 > h,, the method is not absolutely stable. The 
breakdown in the approximation to e*2' for h = 0.06 causes the approximate so- 
lution to exhibit an asymptotic behavior different from that of the exact solution. 
As for the time steps selected by the RKF45 method for t > 2 (bottom graph 
of Figure 7.20), these can be explained as follows. When h < hg, the underlying 
methods are absolutely stable, so estimates of the local truncation error result in 
the selection of a larger step size. Eventually, h grows larger than h,, and absolute 
stability is lost. After a few time steps, the breakdown in the approximation to e424 
produces large local truncation error estimates and forces the step size below hg. 
The cycle then repeats, generating oscillation about the cutoff value for absolute 
stability. 


h< & 0.05036 = he. 


Multistep Methods 
Recall that the general linear m-step multistep method is of the form 


Wi-1 — 1 Wy — AQWi-1 —'°' — OmWi+1—m 
A 
bo f(ti+1, Wit1) + di f (te, we) + bof (tina, Wi-1) + +Onf(tigi-m, Witi-m)- (4) 
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Applying equation (4) to the test problem y’ = y and collecting like terms leads 
to 


(1—bohA)wi41— (a1 +b) hA)w;—(ao+behdA)wi_y— . (Gm tbmhd)wis1—m =0. (5) 


The solution of this linear difference equation is related to the roots of the corre- 
sponding characteristic polynomial 


Q(z, RA) = (1 bohA)z™ — (a, + by hA)z™7} — (ag + bghd)2™-? —. (@m + dmb). 


In particular, suppose that 8, (RA), Bo(hA), Bs(AA),..-,Bm(hA) are the roots of 
Q(z, hd). If the G, are all distinct, then the solution of (5) can be written as 


Wn = oe [G.(AA)]” , 
k=} 


where the coefficients ¢, are determined by the values of wo, wy, we,...Wm-1- 
When the roots are not distinct, the form of the solution changes somewhat but 
will still depend on powers of each of the roots of Q(z, hA). Thus, it is clear that 
the general multistep method will be absolutely stable only when |§,(hA)| < 1 for 
each k. 


Definition. The REGION OF ABSOLUTE STABILITY for a multistep method 
is the set 


R={h\€C| |G,(hd)| <1 for each root of Q(z, hA)}. 


EXAMPLE 7.30 Two Charateristic Polynomials and Their Roots 
Recall that the two-step Adams-Bashforth method is defined by the difference equa- 
tion 


245, 33 I 
aac es = 51 (tn, Wn) ;3 5f (tnt) Wn~1): 


Applying this method to the test problem y/ = Ay leads to, after collecting like 
terms, 


1 
Wnt —- (it ShA)tOn + gun =0. 


The characteristic polynomial associated with this difference equation is 
3 3 1 
Q(z, ha) = 27 -(1+ ghArlz + hs 


whose roots can be obtained from the quadratic formula: 


ayaa) = [(1+ Gea) + Lh S008 


aut) = 3 | (1+ 3m) - tna Bom) 
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The set of 2A values for which both of these roots have magnitude less than unity 
defines the region of absolute stability for this numerical method—see the upper 
left panel of Figure 7.23. 

As a, second illustration, consider the two-step Adams-Moulton method: 


Wati — Wn 


5 2 1 
h ~ qo (tng1s Wri) + gi (ny Wn) ~ a5F (tn-1,Wn-1)- 


12 
The characteristic polynomial associated with this difference equation is 
Ole, hd) = (1 - 2ehd)z? — (14 Zhyjz+ Lha 
uo" 12 3 12°" 


_ whose roots can again be obtained from the quadratic formula: 


(1+ 2hdA) + fl + hd+ G(pay? 


Arid) = x1 — Sha) 
(1+ Bhd) — /1 tha + G (hay? 
a or 


The region of absolute stability determined by these roots is plotted in the upper 
right panel of Figure 7.23. 


Figure 7.23 displays the region of absolute stability for the two- and three-step 
Adams-Bashforth and Adams-Moulton methods. Two important observations can 
be made from this figure. First, the regions of absolute stability for the implicit 
methods (the Adams-Moulton methods) are significantly larger than the regions 
for their explicit counterparts. Second, unlike the Taylor/Runge-Kutta methods 
for which the regions of absolute stability increase in size with increasing order of 
approximation, here we see that the regions decrease in size with increasing order of 
approximation. Thus higher-order Adams-Bashforth and Adams-Moulton methods 
will require smaller values of h to maintain absolute stability. 


Stiff Equations and A-stable Methods 


The chemostat problem used to open this section belongs to a large class of dif 
ferential equations that are called stiff. Loosely speaking, a stiff equation is one 
for which the step size of certain numerical methods must be drastically decreased, 
over at least a portion of the domain, to, maintain absolute stability. The analy- 
sis we have just completed tells us that the mechanism that gives rise to stiffness 
is the presence of vastiy different evolutionary time scales in the solution. Thus, 
whenever a differential equation is modeling a process with multiple, widely var- 
ied time scales, we must be on guard for stiffness. Electronics, weather prediction, 
chemical kinetics, and mathematical biciogy are common sources of stiff differential 
equations. 
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Figure 7.23 Regions of absolute stability for the two- and three-step 
Adams-Bashforth (left panels) and Adams-Moulton methods (right pan- 
els). 


Suppose {A} is the set of characteristic exponents associated with a particular 

differential equation. The quantity 

_ max, [Re Ax] 

~ ming [Re Ax| 
is called the stiffness ratio and is often quoted as a measure of the degree of stiffness 
inherent in a given problem, The stiffness ratio for the chemostat problem is S = 
55.31. As stiffness ratios go, this value is quite tame. Problems in chemical kinetics 
routinely involve ratios on the order of 10'’, and a model of the Big Bang has a 
ratio on the order of 1077. 

Since the techniques we have examined thus far have such small regions of ab- 
solute stability, the application of these techniques to stiff problems is not advised. 
Even the variable step size algorithms are poor choices. Though the variable step 
size methods will automatically select an appropriate step size, these step sizes may 
become so ridiculously small as to make the total computational cost prohibitive. 
Further, when exceedingly small step sizes are used, the propagation of roundoff 
error can become a significant problem. The bottom line is that to approximate effi- 
ciently the solution of stiff problems, we need methods that have regions of absolute 
stability that are as large as possible. The ideal situation would be for the region 
of absolute stability to contain the entire left-hand side of the complex plane, as 
then the step size could be selected solely on the basis of accuracy, and not absolute 
stability. Such methods do exist and are called A-stable. 
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Definition. A numerical method is A-STABLE if it is absolutely stable for 
all RA such that Re hA < 0. 
The Trapezoidal Method 


The only A-stable method among those introduced in the previous sections of this 
chapter is the trapezoidal method: 


Wn+1 — Ww 1 
a = 9 [F (tet hs Way 1) a f(tn Wn)| . (6) 


Recall that this is a second-order, implicit one-step method. When applied to the 
test problem y’ = Ay, equation (6) reduces to 


_ [2+AA 
Wnt = 2—hd Wn: 
Thus se 
Q(hA) = sy 


and the region of absolute stability for the trapezoidal method is given by 


n= {ne c| — <i. 


Since |z| < 1 if and only if |z|? < 1, we can continue our analysis by examining 
|Q(hA)|?. Separating AA into real and imaginary parts, we find 


(2+hd : 


* | (2+Re AA) + ilm AA 
[2—hA| — 


(2 — Re AA) — tlm hd 
_ (2+ Re hd)? + (Im hd)? 
~ (2— Re hd)? + (Im hd)?’ 


Next, set the last expression less than one, clear the fraction and expand both terms 
involving the real part of hA to yield 


4+Re h\+ (Re ha)? + (Im hd)? < 4— Re hA + (Re hd)? + (Im hA)?, 


Some final simplification produces Re hA < 0. Hence, the region of absolute stability 
for the trapezoidal method is the entire left-hand side of the complex plane, and 
the method is A-stable. 

For a linear differential equation, equation (6) can be solved explicitly for wa41 
in terms of Wy, tn, tn41 and h; however, for nonlinear problems, each time step of 
the trapezoidal method will require the solution of an implicit equation for wr41. 
Solving this implicit equation by functional iteration (e.g., in predictor-corrector 
fashion) should not be attempted as this would impose a convergence condition on 
the maximum size of h with the same effect as a stability condition. Instead, a 
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rootfinding technique such as Newton’s method or the Secant method should be 
employed. With Newton’s method, it is customary to select the value of wy as 
the initial guess for the value of w,41. The secant method, of course, requires two 
starting values. wy, can serve as one of those starting values, and the usual practice 
is to obtain the second starting value from an explicit method. When working with 
a system of differential equations, Newton’s method, Broyden’s method, or one of 
the other quasi-Newton methods mentioned in Section 3.10 can be used. 

One major drawback to the trapezoidal method is that although the method is 
A-stable and hence will correctly predict the long-term behavior of the solution for 
any value of h, the method will generally fail to produce an accurate approximation 
to the initial behavior of the solution for “large” step sizes. Within this context, 
“large” refers to any value of A for which hA < ~2. When hA < -2, Q(hA) < 0 
and a sawtooth-like oscillation is introduced into the approximate solution. The 
good news is that since |Q(AA)| < 1, the unwanted oscillations will eventually die 
out. The bad news is that the more negative the value of hA, the more negative 
the value of Q(h), which translates to an increase in the initial amplitude of the 
oscillation, and the closer the value of |Q(hA)| becomes to one, which implies that 
the oscillations will persist longer. 


EXAMPLE 7.31 The Trapezoidal Method Applied to a Stiff Equation 


Consider the differential equation 
y =—-1000y-e7*, y(0)=0. 
The exact solution for this problem is 


1, -10008 _ ,-t 
y(t) a 999 (e € ); 

which clearly evolves with two markedly different time scales: 1 versus 1 /1000. The 

stiffness ratio for this problem is thus 1000. 

The top graph in Figure 7.24 displays the exact solution to the differential 
equation together with the approximate solution computed with the trapezoidal 
method. A step size of h = 1/600 was used. Since the more negative characteristic 
exponent associated with the solution is A = —1000, it follows that AA = —5/3 > 
—9. For comparison, the approximate solution obtained: from the second-order 
Taylor method, with h = 1/600, is also plotted. Note that hA = —5/3 falls inside 
the region of absolute stability for the second-order Taylor method, which, for 
real \, is the interval -2 < hA < 0. Both methods accurately reproduce the 
asymptotic behavior of the exact solution. The trapezoidal method, however, does 
a significantly better job approximating the initial, transient behavior. 

The bottom graph of Figure 7.24 demonstrates the drawback to the trape- 
zoidal method. Three approximate solutions, computed with h = 1/400, h = 1/300, 
and h = 1/200, are plotted. These step sizes correspond to values of hA of —2.5, 
—10/3, and —5, respectively. Since the method is A-stable, the approximate so- 
lution continues to track the asymptotic behavior of the exact solution with these 
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Figure 7.24 Solution of y’ = —1000y —e*, y(0) = 0 using the 
trapezoidal method. (Top panel) Approximate solution, computed with 
hk = 1/600, versus the exact solution. Solution obtained from second- 
order Taylor method included for comparison. (Bottom panel) Ap- 
proximate solutions (trapezoidal method) computed with hA = —2.5, 
hd» = —10/3, and hA = —5. 


larger step sizes. (The second-order Taylor method, which is not A-stable, produces 
exponentially increasing solutions for any step size larger than h = 1/500.) On the 
other hand, as noted above, values of hA smaller than —2 introduce sawtooth os- 
cillations into the approximate solution. Furthermore, with decreasing hA, there is 
an increase in both the amplitude and the duration of the oscillations. 


Backward Differentiation Formulas 


Among the most widely used numerical techniques for approximating the solution 
of stiff problems are the backward differentiation formulas popularized by Gear [1]. 
The process for constructing backward differentiation formulas is quite straightfor- 
ward. Simply evaluate the differential equation 


y(t) = f(t,v@) 


at time level t = fn4, and then replace 7/(tn+1) by a backward difference approxi- 
mation of desired order. Using a first-order backward difference approximation for 
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the derivative term produces the difference equation 


Wnt) — W. 
a = f(tnsisWn4r); 


which is known as the backward Euler method. Like Euler’s method, this is a first- 
order one-step method. The basic difference is that the backward Euler method 
is implicit, while Euler’s method is explicit. Higher-order backward differentiation 
formulas are al) implicit multistep methods. The second and third order schemes, 
for example, are 

4 


1 
Oe zn a gent + gh (tns1, Wnt) 


and 
_ 18 i) 2 6 
Wadd = pW — pp wna + Twn + 77 Ui (tnt, Urea); 

respectively. Although, in principle, backward differentiation formulas to any order 
can be constructed, in practice, only the methods up through order six are useful. 
The methods of order seven and beyond are not convergent. 

What has made the backward differentiation formulas so popular is their 
excellent stability properties. For example, the backward Euler method is A-stable. 
To establish this result, note that for the backward Euler method 


1 


es ae VE 


Therefore, 
1 


IO@)| J — Re Ad)? + (Im AA)? 

For any AA with Re Ad < 0, it follows that 1- Re hA > 1. This implies that 
(1 — Re hA)? + (Im AA)? > 1 and finally that [Q(A)| < 1. Hence the backward 
Euler method is absolutely stable for all hA with Re AA < 0, and the method is A- 
stable. The second-order backward differentiation formula is also A-stable, but the 
proof is much more involved (see Iserles [2]). The methods of order three through 
six are not A-stable. In fact, no multistep method of order larger than two can 
be A-stable. This is known as the Dahlquist -barrier [3]. The regions of absolute 
stability for the backward differentiation formulas of order three through six do, 
however, contain the entire negative real line and, compared to the other methods 
we have studied, large portions of the left. side of the complex plane. 


EXAMPLE 7.32 Backward Differentiation Formulas in Action 
Let’s reconsider the initial value problem 

y' =-1000y—e7", y(0) =0, 
whose exact solution is 


=e A —1000t — -t 
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Figure 7.25 Solution of y’ = —1000y — e~', y(0) = 0 using the back- 
ward Euler method. (Top panel) Approximate solution, computed with 
h = 1/600, versus the exact solution. Solution obtained from trapezoidal 
method included for comparison. (Bottom panel) Approximate solutions 
(backward Euler method) computed with hA = —2.5, hA = —10/3, and 
hA = -5. 


The top graph of Figure 7.25 displays the exact solution of the initial value problem 
together with the approximate solutions obtained from the backward Euler method 
and the trapezoidal method. A step size of h = 1/600 was used with each numerical 
method. For this problem at least, we see that the accuracy of the backward Euler 
method compares quite favorably with that of the trapezoidal method, even though 
the backward Euler method is only first-order, whereas the trapezoidal methed is 
second-order. When larger step sizes are used (in particular h = 1/400, h = 1/300, 
and h = 1/200), the bottom graph of Figure 7.25 shows that the backward Euler 
method does not introduce spurious oscillations into the approximate solution. 
The same is not true for the second-order backward differentiation formula. 
Observe in the top graph of Figure 7.26 that the second-order backward differ- 
entiation formula overshoots the minimum value of the exact solution, though it 
does more closely follow the initial, decreasing portion than does the trapezoidal 
method. This overshoot is caused by the characteristic polynomial associated with 
the formula having complex roots when h = 1/600. These complex roots introduce 
a sinusoidal oscillation into the approximate solution. In fact, with real A, the 
roots of the characteristic polynomial associated with the second-order backward 
differentiation formula will be complex whenever hA < —1/2 (see Exercise 7). Since 
the magnitude of the roots will always be less than 1 (recall that the method is 
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Figure 7.26 Solution of y’ = —1000y—e7*, y(0) = 0 using the second- 
order backward differentiation formula (BDF). (Top panel) Approximate 
solution, computed with h = 1/600, versus the exact solution. Solution 
obtained from trapezoidal method included for comparison. (Bottom 
panel) Approximate solutions (second-order BDF) computed with hA = 
—2.5, hA = ~10/3, and AA = —5. 


A-stable), the amplitude of the oscillation will decay as additional time steps are 
computed. In contrast to the trapezoidal method, however, the initial amplitude 
of the spurious oscillation decreases and the decay rate increases as h increases, as 
demonstrated in the bottom graph of Figure 7.26. 


One final comment needs to be made before leaving this example. Since 
the second-order backward differentiation formula is a two-step method, a one-step 
method must be used to compute w,. For this problem, w; was obtained as follows. 
First, apply the backward Euler method to advance the solution from tp to to +h 
in a single step. Call this value wh Next, apply the backward Euler method to 


advance the solution from ég to tg + A in two steps, calling the result wl es Finally, 
extrapolate from these two values to give 


h/2 
Wy = 2"! —w. 


This procedure, in principle, produces a second-order accurate starting value. 
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Some Final Thoughts 


In this section, the concepts of absolute stability and A-stable numerical methods 
were introduced. Three different A-stable methods were then investigated, and 
the relative merits of each method noted. All of the results derived above were 
based on the examination of a simple linear test problem and a single linear ini- 
tial value problem. Though similar performance can be expected for general linear 
equations and systems of equations, what happens with nonlinear problems? Un- 
fortunately, there is no simple answer to this question, and stability analysis of 
discretized nonlinear ordinary differential equations is currently a very active area 
of research. The major difficulty with nonlinear equations is that the characteristic 
exponents, A, change from point to point, and, hence, the performance of a given 
numerical method can vary significantly as the approximate solution is computed. 
For example, in the event that \ becomes positive, it is possible for the backward 
Euler method to introduce spurious oscillations. This does not mean that the work 
put forth in this section has been wasted. On the contrary, the linear model was a 
convenient starting point that served well in illuminating the basic processes which 
give rise to stiff equations. 

Though we shall not undertake such a development here, variable step size al- 
gorithms can be constructed based on backward differentiation formulas. Extensive 
effort has been made to develop computationally efficient implementations of these 
schemes. Substantial research has also gone into the derivation and implementation 
of implicit Runge-Kutta methods: the & values on which these methods are based 
(such as the k1, kg, kg, and k4 values which define the classical fourth-order Runge- 
Kutta method) appear implicitly. It is known that implicit Runge-Kutta methods 
are A-stable to all orders (Butcher [4]) and hence are suitable for the solution of 
stiff problems. Suggested references for further study on stiff problems include 
Gear [1], Shampine and Gear [5], Aiken [6], Shampine and Gordon [7], Dekker and 
Verwer [8], Lambert [9], and Hairer and Wanner [10]. 
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EXERCISES 


1. 


Consider the fifth-order Taylor method. 
(a) What is the polynomial Q(z) associated with this method? 


(b) For real A, to what interval must the value of fh. be restricted to maintain 


absolute stability? 
(c) Plot the region of absolute stability for the fifth-order Taylor method. 


. Repeat Exercise 1 for the sixth-order Taylor method. 
. Consider the four-step Adams-Bashforth method 


Wit — 55 


h 7 = gg llnwi—5 ote 1, Wi- eat its 2,Wi- 2)~ ai lti-s,wi-a). 


(a) What is the characteristic polynomial, Q(z, AA), associated with this 
method? 


(b) Plot the region of absolute stability for the four-step Adams-Bashforth 


method. 


. Repeat Exercise 3 for Milne’s method 


iL iS = as(tss un) ~ flbeaswea) + 2f(ts2,0e-2)]- 


. Repeat Exercise 3 for Simpson’s method 


ae = ; Uf (tina, wena) + 4f (te, ws) + f(te—1, wi-1)] 


. Repeat Exercise 3 for the leapfrog method 


Sa = 2 (ti, wa): 


. Consider the second-order backward differentiation formula 


4 1 2 
Watt = gwn — Zun-1 + gh (ttt, Wnt). 


(a) What is the characteristic polynomial, Q(z, hd}, associated with this 
method? 


(b) What are the roots of Q(z, hA)? 
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(c) Suppose that » is real. Show that the roots of Q(z, hA) are complex when- 
ever hA < -1/2. 


8. Apply the backward Euler method to approximate the solution of the given initial 
value problem over the indicated interval in ¢ and using the indicated number of 
time steps. Solve any nonlinear algebraic equations using Newton’s method. 

(a) 2 =te®-2 (0<t<1), 2(0)=1, N=4 

(b) 2 + (4/t)z=t4 (1<t<3), 2(lI)=1, N=5 

(c) a’ =(sinz—e’)/cosx (O<t<1), 2(0)=0, N=3 
(d) 2’ =(1+2")/e (1<t<a4), 2(1)=0, N=5 

(e) 2 =#-277-1 (O<t<1), 2(0)=0, N=4 

(f) 2 =2(1—2)/(t?sinz) (1<t<2), 2()=2, N=3 

9. Repeat Exercise 8 using the trapezoidal method. 

10. Repeat Exercise 8 using the second-order backward differentiation formula. Use 
the trapezoidal method to determine wy. 


11. Recall that when A is real and Q(hA) < 0, a one-step method will introduce spu- 
rious sawtooth-like oscillations into the approximate solution. Consider Euler’s 
method. 

(a) What is the polynomial Q(h.) for Euler’s method? 


(b) Use the initial value problems 
IVP#1: y’=-1000y-e*, y(0) =0 


I 1. 
: uy = 9u1 + 24ueg + Scost — 3 sint uz(0) = 4/3 
eee uy = —24u, —5lug—9cost+ Fsint ue(0) = 2/3 


to demonstrate the presence of spurious oscillations in the approximate so- 
lution generated by Euler’s method. 


The following three initial value problems are for use with Exercises 12-15. 
y = -200y + 200sint+cost, y(0)=1 


uy = Gu; + 24u2 + 5cost — 4 sint ui(0) = 4/3 
uy = —24u, — Slu2 —Ycost + zsint u2(0) = 2/3 


—20u; ~19ug wi(0) = 2 
19%; = 20u2 u2(0) =0 


i 


ot 
7) 


i 


12, For the second-order Runge-Kutta method of your choice and for each of the 
three initial value problems listed above, 
(a} determine the maximum allowable time step to maintain absolute stability; 


(b) compute the approximate solution using a step size which is roughly 20% 
smaller than the value found in part (a); and 


(c) compute the approximate solution using a step size that is roughly 20% 
larger than the value found in part (a). 


13. Repeat Exercise 12 with the second-order Adams-Bashforth method. 
14. Repeat Exercise 12 with the classical fourth-order Runge-Kutta method. 
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15. 


16. 


17. 


18. 


On the three problems listed before Exercise 12, compare the performance of the 
trapezoidal method, the backward Euler method, and the second-order backward 
differentiation formula. Choose one value for h such that hA > —2 and several 
(increasing) values for h such that hA < —2. 


Compare the approximate solutions of the initial value problem 
y= 5e(y- 8)? +1, y(0)=1 


obtained using 

(a) the trapezoidal method; 

(b) the backward Euler method; 

(c) the second-order backward differentiation formula; 
(d) the classical fourth-order Runge-Kutta method; and 
(e) the RKF45 method. 


For the fixed step size methods, use h = 0,25, and for the RKF45 method use 
an error tolerance of 5 x 1077, Advance each solution out to t = 1. The exact 
solution of the initial value problem is y(t) =t—e7°*. 


Reconsider the chemostat problem from the beginning of the section: 


ds MSZ 
oe] eS 

dt ats 
dz mse 

dt a+s 


Take m = 16, a = 0.25, s(0) = 0.5, and x(0) = 0.02. Approximate the solution 
to this problem, out to t = 12, using the trapezoidal method, the backward 
Euler method and the second-order backward differentiation formula. Compare 
performance for h = 0.05, h = 0.1, and h = 0.12. 


Consider the system of differential equations 


ay = —0.013y, — 1000y1y3 

dyo _ 

PTs 2500y2y3 

dy3 _ 

ae —0.013y, — 1000y1y3 — 2500y2y3 


subject to the initial conditions yi (0) = 1, y2(0) = 1, and ys(0) = 0. 

(a) Approximate the solution using the RKF45 method with an error tolerance 
of 5 x 1077. Advance the solution to t = 5. In what range do the majority 
of time steps taken during the calculation of the solution fall? Assuming 
this range persists forward in time, roughly how many time steps would be 
needed to advance the solution to t = 50? 

(b) Approximate the solution, out to t = 5, using the trapezoidal method, 
the backward Euler method, and the second-order backward differentiation 
formula with a step size of h = 0.01. Compare with the results from part 


(a). 
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{c) Approximate the solution, out to ¢ = 5, using the trapezoidal method, 
the backward Euler method, and the second-order backward differentiation 
formula, with a step size of h = 0.1. Compare with the results from parts 
(a) and (b). 

Note: This problem was proposed by Gear (“The Automatic Integration of Stiff 

Ordinary Differential Equations,” Proceedings of the [P68 Conference, North- 

Holland, Amsterdam, 1969) as a test problem for software to solve stiff ordinary 

differential equations (ODEs). 


CHAPTER 8 


Two-Point Boundary Value 
Problems 


AN OVERVIEW 
Fundamental Mathematical Problem 


In this chapter, we will develop numerical techniques for approximating the solution 
of the two-point boundary value problem 


ye = flayy), aszrsb, 
any(a) + azy’(a) = ag 
By y(b) + Boy’ (b) = Bs, 


where f is an arbitrary function of its three arguments. When f is of the form 


f(x,y, y') = plz)y! + a(2)y + r(2z) 


for some functions p, g and r, the boundary value problem is called linear; otherwise, 
it is nonlinear. The boundary conditions given above, in which a value is specified 
for a linear combination of the unknown function and its first derivative, are called 
Robin (or mixed) boundary conditions. The special case in which the value of the 
unknown function is specified [e.g., y(a) = a and/or y(b) = 8] is called a Dirichlet 
boundary condition, while the special case in which the value of the first derivative is 
specified [e.g., y’(a) = a and/or y'(b) = 8] is called a Neumann boundary condition. 
More general boundary conditions, such as periodic conditions, can be specified, 
but we will not consider them here. 


Steady-State Temperature Distribution in a Pin Fin 


In heat transfer, the term extended surface is used in reference to a solid that 
experiences energy transfer by conduction within its boundaries and by convection 
between its boundaries and the surroundings. The most common application of 
an extended surface is to enhance the heat transfer rate between a solid and an 
adjoining fluid (such as air or water). Extended surfaces used in this manner are 
called fins. A fin that has a circular cross section is called a pin fin. 

Suppose we have a pin fin of length L and varying radius r(z) attached to a 
surface, as shown in Figure 8.1(a). We would like to determine both the temperature 
distribution along the length of the fin, T(x), and the total fin heat transfer rate, q;. 


656 
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(a) (b) 


a Fee 
ax 


Figure 8.1 (a) A pin fin with nonuniform cross section. (b) Energy 
balance on an arbitrary slice of a pin fin that experiences convective heat 
loss from its lateral surface. 


Performing an energy balance on the arbitrary slice of the fin shown in Figure 8.1(b) 
gives 


dq, 
Qe = Gn + 7a + qeonv; (1) 


where gy is the rate of heat transfer along the length of the fin due to conduction 
and Qconv is the rate of heat transfer from the lateral surface of the fin due to 
convection. According to Fourier’s law 


ge = KAS, () 
oo 
where K is the thermal conductivity of the fin and A is the area through which 
the heat flows. The minus sign indicates that heat flows from regions of high 
temperature to regions of low temperature. The convection heat transfer rate is 
given by 
Qconv = AAs(T — Too). (3) 
Here, h is the convection heat transfer coefficient, A, is the area of the surface from 
which convection takes place and T., is the temperature of the adjoining fluid. 
Substituting (2) and (3) into (1), taking into account the geometry of the fin 
that implies that A = z[r(z)]? and A, = 2mr(z)Az, and simplifying the resulting 
equation yields 


 (IreevPSe) ~ Grey To) = 0. (4) 


At x = 0, where the fin meets the solid, the temperature of the fin equals that of 
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F(N, X) AAX 


J(X+AX)A 


X=L 


Figure 8.2 Geometry and control volume for insect dispersal problem. 


the solid, 7). Thus, we have the boundary condition 
T(0) =T. (5) 


At the tip of the fin, the heat conducted to the tip must balance the heat convected 
away from the tip, so we have the boundary condition 


~kT"(L) = h(T(L) —Tro). (6) 


The boundary value problem given by (4), (5) and (6) determines T(x). Once T(z) 
is known, the total fin heat transfer rate is given by 


gg = —ke[r(0)}?T"(0). (7) 


Spatial Distribution of an Insect Population 


Suppose a population of insects is placed into a closed environment in the shape of 
a circular cylinder of length L (see Figure 8.2). We would like to model the spread 
of the population throughout the environment, with the objective of determining 
the eventual steady-state distribution of the insects. For simplicity, we will assume 
the insects disperse along the axis of the cylinder only. At X = 0, such harsh 
environmental conditions are maintained that the insects cannot survive, while a 
barrier prevents the insects from migrating beyond X = L. 

Let N(X) denote the steady-state population density of the insects, measured 
in insects per unit volume. Consider the arbitrary slice of the cylinder which is 
highlighted in Figure 8.2. This slice is called a control volume. The flux function 
J(X) measures the number of insects that cross location z in the positive direction 
per unit area per unit time. The function F'(N,X) describes the net birth rate of 
insects per unit volume, Finally, A denotes the cross-sectional area of the cylinder. 
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At steady state, the number of insects within the control volume must remain 
constant. This requires that the conservation law 


the rate at which 


the rate at which the rate at which ; t 
insects enter i insects leave + nia : ae q =0 
the control volume the control volume anaes 
the control volume 
hold. Using the information from Figure 8.2, the conservation law becomes 
[J(X) — U(X + AX) A+ FUN, X)AAX = 0. 
Dividing this equation by AAX and taking the limit as AX — 0 yields 
dJ 
-> 4+ F(N,X)=0. 
Se + FIN, X) =0 (8) 


To proceed further, we relate the flux to the population density by assuming 
Fick’s law, which states that 7 is proportional to the gradient of N. Thus, 


pee ees (9) 


The coefficient of diffusion D measures the efficiency with which the insects disperse, 
and the minus sign indicates that migration takes place from regions of high density 
to regions of low density. For the source term we will use the logistic growth law; 
that is, 


F(N,X) =rN(1- N/K), (10) 


where r is the reproductive rate of the insects and K is the environmental carrying 
capacity. Substituting (9) and (10) into (8) and assuming that D is constant yields 


a’N N 


The boundary conditions associated with (11) are 
aN 
N(0) = d — (Lb) =0. 
N@/=0 and TW) 


If we now introduce the nondimensional variables n = N/K and x = X/r/D, we 
arrive at the boundary value problem 


n’+n(l1—-n)=0, n(0)=0, n’(l)=0, (12) 


where primes indicate differentiation with respect to « and / = L,/r/D. 
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The Remainder of the Chapter 


In this chapter, we will develop two techniques for the numerical solution of two- 
point boundary value problems: the finite difference method and the shooting 
method. We will start with the finite difference method. In Section 8.1, the linear 
boundary value problem with Dirichlet boundary conditions is considered. Non- 
Dirichlet boundary conditions (Neumann and Robin conditions) are considered in 
Section 8.2. The solution of nonlinear boundary value problems is discussed in Sec- 
tion 8.3. The final two sections will present the shooting method, covering linear 
problems and nonlinear.problems, in that order. 


An Artificial Singularity 


When a partial differential equation is reduced to an ordinary differential equation 
using symmetry considerations, an artificial singularity often appears in one or more 
of the coefficients of the ordinary differential equation. The singularity is artificial 
in the sense that the exact solution is smooth near the coefficient singularity. For 
example, the boundary value problem 


u!(0) = u(1) =0 


has a coefficient singularity at 2 = 0, but the exact solution 


u(z) =2In (a) 


is smooth at x = 0. The handling of artificial singularities will be treated where 
appropriate throughout the chapter. 


8.1 FINITE DIFFERENCE METHOD, PART I: 
THE LINEAR PROBLEM WITH DIRICHLET BOUNDARY CONDITIONS 


Rather than jumping straight into a treatment of the general second-order one- 
dimensional two-point boundary value problem 


Y Sie ey), @lesh, 
any(a) + ary’ (a) = a 
Bry(b) + Boy’ (b) = Bs, 


we will begin our study of finite difference methods by investigating the linear 


problem 
y” = p(z)yy'+q(a)jyt+r(z), 2 € |e, (1) 


subject to the Dirichlet boundary conditions 


y(a) =a, y(b) = p. 
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As with the initial value problem techniques we developed in Chapter 7, the objec- 
tive of a, finite difference method is to approximate the value of the exact solution 
to the boundary value problem, y(x), at a discrete set of points 9, 41, Ze, ..., 
ay € [a,b]. Throughout our discussion, we will let y; denote the value of the exact 
solution at « = z;, and we will denote our finite difference approximation to y; by 
w;. This finite difference approximation is obtained by replacing each derivative 
which appears in the boundary value problem with an appropriate finite difference 
formula (see Chapter 6). This converts the single continuous ordinary differential 
equation for y(z) into a system of discrete algebraic equations for the values wo, 
Wy, Wo, ..., WN. 


The Computational Grid 


To begin the approximation process, we must first introduce a partition, or compu- 
tational grid, over the interval [a,b]. Let N be a positive integer that denotes the 
number of subintervals in the partition and denote the partition itself by 


A= 2% <2) < 22 <9 <EN_-1 < BN = b. 


The points 29 and zy are called boundary grid points, while the points 2, 22, 
3, ..., €n—1 are called interior grid points. For simplicity, a uniform grid will 
be assumed. That is, we will assume that 1; = a+ ih, where h = (b—a)/N. 
The parameter h, known as the step size or the mesh size, is the key parameter 
governing the accuracy of the finite difference approximation. 


The Finite Difference Approximation 
Once the computational grid has been established, we evaluate the differential equa- 
tion, that is, equation (1), at each interior grid point 

{y" = p(z)y' + g(z)y+r(2)}lene, (1SiSN-1) 


and then replace all of the derivatives by second-order central finite difference ap- 
proximations: 


Vit — 2yi + Yi-1 
h2 


To simplify notation, we have used p;, q;, and r; to denote the coefficient function 
values p(z;), g(a:) and r{x;), respectively. Next, we drop the truncation error terms 
and replace all of the y’s (exact solution) by w’s (approximate solution). Thus, for 
1A Bcc 1 


+ O(h?) = pete + gig +r + O(h?). 


Wity — 2W;i + Wi-1 a go 7 Wied 
h2 a 2h 
To completely determine the unknowns wo, wi, We, ..., wn, two more equations 


are needed. These will come from the boundary conditions. At the boundary grid 
points, the values y(z9) = a and y(n) = 6 have been specified: therefore, we will 


+ ggg + 15. (2) 
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Set wo = a and wy = 8. Combining these equations with equation (2) constitutes 
the finite difference method. Since the truncation error terms that were dropped 
to form equation (2) were O(h?), the resulting numerical method will be second 
order. This point will be demonstrated below. 


Second-Order Finite Difference Method 


uw=a 
Wis. — 2W; + Wi1 Wir. — Wi-1 . 
ag Be = NE TG 1<i<N-1 
wn = 8B 


Matrix Formulation 


Since the equations of our numerical method are linear in the unknowns, we will 
express the system in matrix form. With wo and wy known from the Dirichlet 
boundary conditions, there are N~1 unknowns to be determined. To construct the 
coefficient matrix and the right-hand-side vector for the system, start by multiplying 
equation (2) through by —h? (to avoid division by a small number—the negative 
sign is included so as to obtain a positive coefficient on w;) and then collect terms. 
For i = 1, 2, 3, ..., N —1, we therefore have 


(-1 = 3) wir + (2+ A? qiywi + (-1 + 3) Wig = —hP ry. (3) 
We can think of this equation as a computational template or stencil, which is 
to be applied at each grid point where the value of the approximate solution is 
unknown. From equation (3), we recognize that each row of the coefficient matrix 
will have only three nonzero entries: the entry along the main diagonal] and the 
entries one position to the right and left of the main diagonal. Hence, the coefficient 
matrix will be tridiagonal. The first row (corresponding to 4 = 1) and the last row 
(corresponding toi = N—1) of the coefficient matrix and the right-hand-side vector 
require some care in writing out because wo and wy are not unknown. For 7 = 1, 
equation (3) gives 


h h 
(-1 = 3°) Wo + (2 + h?qr)wr - (-1 + #0] w= =hry, 


Substituting wo = a and transposing the first term to the right-hand side, we arrive 
at 


h A 
(2+ hqy wy + (-1 + #1) we = —h?r, + (1 + 1] a 


as the first equation in the system. Working in a similar manner for i = N — 1, we 
find the last equation to be 


bh 
(-1 = $Pv--1) wn—o+ (2+ h?gn-i)wn—1 = heya + ( = sev) B. 
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The system of equations for w;, we, w3, -.., WN—1 can therefore be written in the 
form Aw = b, where 


ay Uy 
lz dp Us, 
lg d3 UZ 


In-3 dn-3 Un-—3 
In-2 dn-2 un-2 


ly-1  dy-1 
with 
d@,=24+ ng (4) 
h 
ui = —1L + OP: (5) 
h 
k=-1- aPi- (6) 
The vectors w and b are given by 
wy —hr, + (1+ Ep)a 
Ww. —h re 
we and b= 
WN-2 —hry_2 
WN-1 —h’ry_1 + (1— $pn-1) B 


Note how the boundary conditions have been incorporated into the first and last 
entries of the vector b. 


Solvability of Discrete Equations 


Does this system of algebraic equations have a unique solution? Let’s suppose that 
p is continuous and that g(x) > 0 on [a,b]. By the Extreme Value Theorem, the 
continuity of p over the closed interval [a,b] guarantees the existence of a positive 
constant J such that |p(z)| < L on [a,6]. If the step size A is chosen to be smaller 
than 2/Z, then for each i, ~1 < Ap;/2 <1, which in turn implies that the terms 


h 
-l-=p; and —-1+ ah 
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are always negative. Therefore, 


h h 
“1 - Fn]=14 30 
h h 
-14+-—-p,|=1-—-—9, 
Tip Pek Ps 


so that along rows 2 through N — 2 of the matrix A, 


=2< [2+ h?qi|; 


h 
ghi| + |-1+ om 


that is, in rows 2 through NV — 2 the sum of the absolute values of the off-diagonal 
entries is less than or equal to the absolute value of the diagonal entry. The first 
and last rows satisfy an even stronger condition: the sum of the absolute values 
of the off-diagonal entries is strictly less than the absolute value of the diagonal 
element. Hence, A is diagonally dominant, with its first and last rows being strictly 
diagonally dominant. These conditions are sufficient to guarantee a unique solution 
to our finite difference equations (see Isaacson and Keller [1]), provided the step 
size, h, is selected smaller than 2/Z. 


Alternative Matrix Formulation 


When we deal with Neumann and Robin boundary conditions in Section 8.2, it will 
be easier to generalize the matrix formulation of the finite difference method if we 
include wo and wy in the vector of unknowns; that is, if we formulate the matrix 
equation based on all N + 1 equations 


wo=a 
A 2 A ee 2 
—l— gp wit 2th qjwi+ (-l+ op wii bry. Peas N= 1 
wn = B. 


Taking this approach, the system of finite difference equations can be written in 
the form Aw = b, where A is the (N +1) x (N +1) tridiagonal matrix 


1 0 
ly dy Uy 


lg ay U2 


ln-2 @n-2 UN-2 
In-. @N-1 UN-1 
0 1 
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and the vectors w and b are given by 


wo a 
Ww) hry 
We —h* ps 
w= and bs 
w he 
N-2 TN-2 
WN-1 —h?ry_y 
wn B 


Recall that d,, #4, and J; were defined in equations (4), (5) and (6). Following the 
same analysis given above, this newly formulated matrix A is diagonally dominant 
with strictly diagonally dominant first and last rows and, hence, is guaranteed to 
be nonsingular when the step size is selected to satisfy A < 2/Z. 


Worked Examples 


EXAMPLE 8.1 Demonstration Problem 1 


Consider the boundary value problem 


—ul + a®y = 2n* sin(xe) 
u(O) = w(4) = 0. 


Comparing this problem with the prototype problem given by equation (1), we 
see that p(z) = 0, g(x) = w* and r(x) = —2n? sin(wz). Since p(x) = 0, we are 
guaranteed of a unique solution to the finite difference equations for any value of h. 

Let’s start with a uniform partition of the interval [0,1| containing N = 4 
subintervals. Then A = 1/4 and the grid points are given by 2; = at+ith = 
0+1(1/4) =i/4. This partition is illustrated in the Figure 8.3. 

Evaluate the differential equation at the interior grid point z = 2; and then 
replace the second derivative by its second-order central finite difference approxi- 
mation. This produces the equation 


Uz + 2ug — Usqd 


(1/4)? 


Next, drop the truncation error term and replace each u;, which is a value of 
the exact solution, by w;, which is a value of the approximate solution. Multiply 
both sides of the resulting expression by (1/4)? and group like terms to yield the 
computational template 


+ O(R?) + xu; = 2a? sin(ir/4). 


—wyir + [2+ (9 /4)?] wi — wig, = 2(m/4)? sin(im/4). 
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Dirichlet boundary condition applied 


%=0 420.25 4=05 3420.75 x, =] 


Apply computational template 
at these grid points 


Figure 8.3 Figure for Example 8.1. 


Writing out the equations that correspond toi = 1, i = 2, andi = 3, and combining 
these with the equations wo = 0 and wy = 0, which are derived from the Dirichlet 
boundary conditions, the complete system of finite difference equations is given by 


1 0 wo 0 
—1 24 (1/4)? -1 Wi V2(n/4)? 
—1 2+ (x/4)? -1 we | = | 2(r/4)? 
—1 2+(n/4)? -1 w3 V2(n /4)? 
0 1 Wa 0 


The solution of this tridiagonal linear system is 
w=[0 0.725371 1.025830 0.725371 0 ]”. 


The following table compares this approximate solution with the exact solution: 
u(x) = sin(wz). The accuracy of the approximate solution is quite reasonable given 
the crudeness of the computational grid. 

Approximate Exact 


z; Solution, w; Solution, u; Absolute Error 


0.00 0.000000 0.000000 
0.25 0.725371 0.707107 0.018264 
0.50 1.025830 1.000000 0.025830 
0.75 0.725371 0.707107 0.018264 
1.00 0.000000 0.000000 


A numerical verification of the second-order accuracy of the scheme is presented 
in the next table, which displays both the maximum absolute error and the root 
mean square (rms) error in the approximate solution as a function of the number 
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Exact 
Approximate 


Figure 8.4 Comparison between the exact solution to the boundary 
value problem —u” + a?u = 2n? sin(z), u(0) = u(t) = 0, and the finite 
difference approximation generated with a uniform partition containing 
32 subintervals. 


of subintervals, NV. The rms error was computed according to the formula 


, Na 
rms error = 4) 4; ; (wi — uz)?. 
=1 

Note that with each doubling of N, the step size is cut in half, and the approximation 
error is reduced by roughly a factor of 4, which is what one would expect from a 
second order numerical method. The finite difference approximation obtained using 
N = 32 is shown in Figure 8.4—denoted by the diamonds—superimposed on the 
exact solution. 


N Maximum Absolute Error Error Ratio rms Error Error Ratio 


4 0.0258297765 0.0182644101 

8 0.0064337127 4.014754  0.0045493219 4.014754 
16 0.0016068959 4.003814 0.0011362470 4.003814 
32 0.0004016275 4.000961 0.0002839935 4,000961 
64 0.0001004008 4.000241 0.0000709941 4.000241 
128 0.0000250998 4.000060 0.0000177483 4.000060 
256 0.0000062749 4.000015  0.0000044370 4.000015 


512 0.0000015687 4.000000  0.0000011093 4.000000 
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EXAMPLE 8.2 Demonstration Problem 2 


As a second example, consider the boundary value problem 


ul! = (2 + 1)u’ + Qu+ (1 -—2?)e* 
u(0)=—-1, u(1)=0. 


For this problem, we see that p(x) = —(2 +1), whose maximum absolute value on 
the interval [0,1] is 2; hence, we are guaranteed of a unique solution to the finite 
difference equations for any value of h less than 1. 

Let’s use the same partition as in the previous example; that is, divide the 
interval [0,1] into N = 4 equal subintervals, so that h = 1/4 and 2; = 1/4. Applying 
the computational template 


h h 
(— = 7) wi + (2+ h?qi)ws + (-1 + #0) Wer = —h ri, 
for i= 1, 2, and 3, where 
pe = (yt 1) = -(1 + 1/4); 
qi = 2; and 
n= (l-aje* = [J - (4/4)? e/4 


and including the equations w) = —1 and wa = 0 obtained from the boundary 
conditions, we arrive at the system of finite difference equations 


1 0 Wo -1 
ats ae sine | eee See etre 
: 3s ea _ 39 w ae apa 
32 8 32 3 256 
0 1 W4 


The solution of these equations is 
w=[-1 0.582559 —0.301452 —0.116906 0 iF 


which compares favorably with the exact solution u(z) = (x — 1)e~*, as shown in 
the following table. A numerical verification of the second-order accuracy of the 
finite difference scheme is left as an exercise. 


Approximate Exact 
x; Solution, w; Solution, ui Absolute error 


0.00 —1.000000 — 1.000000 

0.25 —0.582559 —0.584101 0.001542 
0.50 —0.301452 —0.303265 0.001813 
0.75 —0.116906 —0.118092 0.001186 


1.00 0.000000 0.000000 
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a T T T 


a 
Exact 
° © — Approximate 


Figure 8.5 Comparison between the exact solution to the boundary 
value problem wv” = —(2 + 1)u’ + 2u+ (1—27)e™, u(0) = -1, u(1) = 0 
and the finite difference approximation generated with a uniform parti- 
tion containing 32 subintervals. 


The approximate solution obtained using a uniform partition with N = 32 subin- 
tervals is shown in Figure 8.5—denoted by the diamonds—superimposed on the 
exact solution. 


An Application Problem: Flow between Parallel Plates 


In Chapter 6 (see page 515), we investigated the flow of a viscous fluid filling a gap 
between two large parallel plates. One plate was stationary, while the other moved 
with constant velocity, and there was a linear temperature gradient between the 
plates. The velocity distribution established within the fluid, U(Y), was found to 
satisfy the boundary value problem 


ay (ue) =0 (7 


Here, y(Y) denotes the viscosity of the fluid, h the separation between the plates, 
and Up the velocity of the moving plate. The solution of (7) was expressed in terms 
of two definite integrals that had to be calculated numerically. 
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Figure 8.6 Velocity distribution established in viscous fluid filling gap 
between two large parallel plates. 


Here, we will determine the velocity distribution by applying the finite differ- 
ence method. As in Chapter 6, we take the fluid to be water and use 


839.456 421194.298 \ 


Y)= —8.944 — —_ 5, —__ 
pe) ep { 393.16 + 80Y/h ’ (293.16 + 80Y/hp2 (8) 
Substituting (8) into the differential equation in (7), expanding the derivative of 


the product and simplifying yields 


PU _ [160 421194.208 80 839.456 au 
dy? | h (293.16+80Y/h)? hh (293.16 + 80Y/h)? | dY” 


Now introduce the nondimensional variables u = U/Uo and y = Y/h to obtain the 
boundary value problem 


i 842388.596 839.456 | ; (9) 


(293.16 + 80y)> (203.16 + 80y)2| ~ 
u(O)=0; ~a(l)=1, 


where primes denote differentiation with respect to y. Finally, taking a uniform 
partition of [0,1] with N = 100 subintervals, the velocity distribution shown in 
Figure 8.6 is obtained. Compare the velocity distribution in this figure with the 
distribution displayed in Figure 6.15 (page 517). 
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EXERCISES 


In Exercises 1-5, 


(a) identify the interval (a, 6] and the functions p, g, and r; and 
(b) using a uniform partition of [a, ] with N = 4 subintervals, write out the system 


aap woN 


of finite difference equations. 


y= — oy = GE y(0) = y(1) = 0 


: yl" mss 120y = 2560, y(0) = y(0.15) = 20 


zy” — (22+ 1)y'+(@+1)y=0, y(1) =2e, y(3) = 10e8 


. (l-xcota)y” -ay +y=0, y(l)=14sin(1), y(2) =2+sin(2) 
- (ay) +y=10, (1) =20, (3) = 100 
. Suppose we use the finite difference method to approximate the solution of the 


boundary value problem y” = p(x)y'+¢(z)y+r(z), y(a) = a, y(b) = B. Choose 
any point z = c with a < c < b, and select N so that c is one of the interior grid 
points. Let was(c) denote the approximation to y(c) obtained with a uniform 
partition containing m subintervals, and let 7 be a nonnegative integer. If the 
errors associated with the finite difference method are O(h“), toward what value 
should the ratio 
wes w(C) — W341 (C) 
We3+1N(C) =) Waj+2n (Cc) 


converge as 7 is increased? 


In Exercises 7-14, approximate the solution of the indicated boundary value problem 
using the finite difference method. If the exact solution is given, confirm the second- 
order accuracy of the numerical method using both the maximum absolute error and 
the root mean square error. If the exact solution is not given, use the technique outlined 
in Exercise 6 to confirm the second order accuracy of the numerical method. Explain 
any unusual behavior. 


7. 


11. 


12, 


ul = (et lw +2ut(1-@e®, u(0)=—1, ull) <0, u(2) = (e—1e* 
2d 2 dy wth 2 = _ =k —22 1 1 te 
of (eH) =, uoauoy=o, wey = fe + Hee Ne 
1-1 

re 


-y tay ty=27, y(0)=0, y()=l 
.u+3u'=2? +sinz, u(-5)=10, u(13.2) = 23 


emon (°&) =-l, ul)=0, uf2}=-1/2,  ul(p)= (1 - 0”) 


ty” —(a+5)y'+4y=2, y(l)=-1, y(2)=1 
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13. 


14, 
15. 


16. 


17. 


18. 
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ay! — ay! + 2y = 3274 2Inz, y(1)=9/4, (2) = 13In2, 


3 
y(z) = 5 (2 + oa - 2”) + (14 32") lng 


y" —2ey'+2y=-1, y(-1)=0, y(1)=1 


A wooden beam of square cross section is supported at both ends and is carrying 
a distributed lateral load of uniform intensity w = 20 lb/ft and an axial tension 
load T = 100 tb. The deflection, u(x), of the beam’s centerline satisfies the 
boundary value problem 


a w 
ui > pple ~ app tll -—z), u(0) =u(L) =0, 


where L = 6 ft is the length, B = 1.3 x 10° Ib fin? is the modulus of elasticity 

and J = s* is the moment of inertia of the beam. The side length of the square 

cross section is s = 4 inches. 

(a) Determine the deflection of the beam at 1-inch intervals along its length. 

(b) Repeat part (a} assuming that the beam tapers along its length so that 
s = (4—2/2L) inches. 


Repeat Exercise 15 for a metal rod of circular cross section. Use the parameter 
values 
w = 200 lb/ft, T = 750 lb 
L=10f, E=3.0x 10" lb/in?, and J = mr4/4, 


For part (a), take r = 3 inches. For part (b), use r = (8+0.25sin(wz/L)) inches. 
Rework the “Flow between Parallel Plates” problem, taking 


830.456 421194.298 
EL Sep {80 ~ 373.16 —80Y/h | (873.16 — 80Y/h)2 \ 


This models the situation where the lower, stationary plate is maintained at 
100° C, the upper, moving plate is maintained at 20°C and there is a linear 
temperature gradient between the two plates. 


In the “Flow between Parallel Plates” problem, suppose we introduce the effect 
of a constant pressure gradient, which we denote by dp/da, in the direction of 
the flow. The boundary value problem (9) then becomes 


go | 842388.598___—889.456__| i? dp 
w= S| (293.16 + 80y)> — (293.16 + 80y)? Uo dz 
u(0)=0, ul)=1. 
The constant he ge is called the pressure parameter. Calculate the velocity 


distribution for values of the pressure parameter of —4, —2, 0, 2, and 4. 
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8.2 FINITE DIFFERENCE METHOD, PART II: THE LINEAR PROBLEM WITH 
NON-DIRICHLET BOUNDARY CONDITIONS 


In Section 8.1, we considered the prototype one dimensional two-point boundary 
value problem 


y" = ple)y'+a(z)ytr(z), 2 € [a,)| 
subject to the Dirichlet boundary conditions 


If our boundary value problem were a model for the one-dimensional steady-state 
conduction of heat in a metal rod, Dirichlet boundary conditions would correspond 
to prescribed temperatures at each end of the rod. 

In practice, however, the temperature at each end of the rod might not be 
known. For example, we might only know that the end at zc = 6 was insulated, 
so that there was no heat flux from that end. This would give rise to a boundary 
condition of the form 

y'(b) = 0. 


A boundary condition of this form, in which the value of the derivative is specified, is 
known as a Neumann boundary condition. We also might only know that convective 
heat transfer is taking place at the end « = a. This would give rise to the boundary 
condition 
—ky'(a) = h [Too - y(a)], 

where K is the thermal conductivity of the rod, h is the convective heat transfer 
coefficient between the rod and its surroundings, and T,, is the ambient temper- 
ature of the surroundings. A boundary condition of this type, in which a linear 
combination of the value of the function and the value of the first derivative is 
specified, is known as a Robin boundary condition. 

In this section we will investigate the formulation of finite difference approx- 
imations for linear boundary value problems subject to both Neumann and Robin 
boundary conditions. 


Non-Dirichlet Boundary Conditions 


Because the general Neumann boundary condition 
y(a)=a or y'(b)=8 
is just a special case of the general Robin boundary condition 
any(a) + ay'(a) = a3 or fry(b) + Boy’(b) = Bs 


(set a = 0 or 6, = 0), we will develop the system of algebraic equations for the 
finite difference approximation to the linear boundary value problem 


cA 


y” =p(x)y' +a(xz)y+r(z), 2 € [2,6] 
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subject to the Robin boundary conditions 


ayy(a) + aay’ (a) = a3 
Biy(b) + Bry’ (b) = Bs. 


We have already discussed the handling of Dirichlet boundary conditions, so in 
what follows, we will assume that a 4 0 and £2 £0. 

For the computational grid, let N be a positive integer, and partition the 
interval [a,b] into 


@=%o <2) < 2g <0 Sony) < on =), 


where x; = a+ th and h = (b—a)/N. Further, let w; denote the approximation to 
the exact solution, y(x), at 2 = z;. 

We need N +1 equations to determine the values wo, Wy, We,..-,wn. N-1 
of these equations are obtained as in the previous section: Evaluate the differential 
equation at each interior grid point z = x;(1 <i < N —1), replace the derivatives 
by second-order central difference approximations, drop the truncation error terms, 
and collect like terms. The resulting computational template is 


h h 

(= a 3”) wi-r + (2+ 7g); a (-1 of 5p.) Wit1 = hry. 

The only remaining question is what we do at 2 = a and at zy = b. 
Let’s focus on the treatment of the boundary condition at zp = a: 


ay(a) + ogy'(a) = a3. 


To maintain the second-order accuracy of the other equations, we could replace the 
derivative in the boundary condition by the O(h”) forward difference approximation 


1 ~8yi t+ Yin — Vise. 
a) ae 


unfortunately, this would destroy the tridiagonal structure of the coefficient: ma- 
trix. Using a first-order forward difference formula would maintain the tridiagonal 
structure of the coefficient matrix but would degrade the accuracy of the overall 
approximation. 

A third possibility, which will maintain both the structure of the coefficient 
matrix and the second-order accuracy of the approximation, is to introduce a “fic- 
titious node” to the computational grid. 

Applying the computational template for the differential equation at z = zo 
produces 


, . 
(-1 = 3 wy + (2+ h?qQ9)wo + (-1 + 5?) wy = —h?rg; (1) 
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Neumann/Robbin 
boundary condition 
specified 
| h h h 
Xp Xj =a x, X, 
—————— re: 
Computational domain 
Fictitious 
node 


of course, wy must be eliminated from this equation, This is accomplished by 
applying the Robin boundary condition: 


W1— Wr 
ah 
where we have replaced the first derivative with its second-order central difference 

approximation. Solving this last expression for wy yields 


2h 
Wp = W1— a — a1). 


anya) +azy'(a)=a3 > aywotaz = 03, 


Substituting this relation into equation (1), we obtain the finite difference equation 


associated with z =a: 
a ray 
2+h?qo —(2+ hpo)h— wo — 2w1 = —h?rg — (24+ hpo)h—. 
2 2 


For a Neumann boundary condition, a; = 0, so the corresponding finite difference 
equation would read 


(2 + h?qo)wo — 2w; = —h?ro — (2+ Apo)ha, 


where we have written a for the ratio a3/a2. 
Performing a similar analysis for a Robin boundary condition at x = b, we 
find the corresponding finite difference equation to be 


—2wy_1 + |2+ A?qn + (2- bpm) wy = —h?ry + (2- hpn)h@2. 
2 


Be 


The derivation of this equation is left as an exercise. For a Neumann boundary 
condition at + = b, the equation would be 


—2wn-) + (2+ han )wn a —hrn +(2- hpn hg, 
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where we have written 6 for 33/2. 

We now have all N+1 finite difference equations. Having started from a linear 
differential equation, the resulting algebraic equations are linear in the unknowns. 
Let w = [ wo Wy We - - : wy \’ denote the vector of unknowns, and let 
the matrix A and the vector b have the structure shown in Table 8.1. The finite 
difference equations can then be written in the form Aw = b. Dirichlet boundary 
conditions have been included in Table 8.1 to present a complete summary of the 
second-order finite difference method for linear boundary value problems. 


EXAMPLE 8.3 A Problem with One Neumann and One Robin Boundary 
Condition 


Consider the linear ordinary differential equation 
vu +u=sin(3z), «x € [0, 2/2] 
subject to a Robin boundary condition at x = 0: 
u{0) + u/(0) = ~1 
and a Neumann boundary condition at 2 = 71/2: 
ul(a/2) = 1. 


Let’s take a uniform partition of [0,7/2] with four subintervals. Then h = 7/8 and 
a; = tn/8 for i =0, 1, 2, 3, and 4. Comparing the given differential equation with 
the prototype, we see that p(x) = 0, g(z) = —1 and r(z) = sin(3z). Therefore, for 
each 4, 


qg=-l; and 
ry = sin(3in/8). 


Furthermore, for the Robin boundary condition at x = 0, we have a1 = ag = 1 
and @3 = —1. Finally, with the Neumann boundary condition at x = m/2, we have 
B = x . . 

Using Table 8.1, we then find that the system of finite difference equations is 
given by 


| 
par 
boa 
ran 
aL 
H 
pany 
ee 
woe 
tl 
i 4 
DIA cola oofy ola 
N 


er we 
nD iw) 
iss) 
Bb 
“—— 
— 


| 
a 
a. 
| 
ee 
& 
roy 
| 
a i a ain 


where 
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[ Qi. Qiz 
ly dy U1 
ly do U2 
A= : 
In-1 0 dw-1 UN-1 
L GN+1,N Q@N+41,N+41 
Z by - 
—h?r, 
—hPr, 
b= 
—hry 4 
[ bn41 
h A 
dj =2+h?q, p=-l+sp, k=-1-— 5p: 
+g U. + 5? 9? 
1s Dirichlet BC at x =a 
a14=¢ do, Neumann BC at r=a 
dg + 2hloa,/a2, Robin BC atr=a 
tet 0, Dirichlet BC at x =a 
w™ | —2, otherwise 
1, Dirichlet BC at s = b 
GN+1,N+1 = 4 an, Neumann BC at z=} 
dy — 2hunf,/G2, Robin BC at cs =6 
‘ _ Jf 9, Dirichlet BC at x =) 
N+1,N =) _9 otherwise 
a, Dirichlet BC at z =a 
bh = « —h?rg + 2higa, Neumann BC at tr =a 
—h?ro aes 2hloas/az, Robin BC at z =a 
B, Dirichlet BC at x = 6 
bvear =< —h?ry — 2hunB, Neumann BC at x = 5 
—h?ry — 2hun 63/82, Robin BC atz=b 
BC = boundary conditions 


TABLE 8.1; Matrix Formulation of Second-Order Finite Difference Method for the Linear Boundary 
Value Problem y” = p(a)y’ + ¢(x)y + r(x), 2 € [a,6] Subject to Some Combination of Dirichlet 


Boundary Conditions—y(a) = a, y(b) = G—Neumann Boundary Conditions—y’(a) = a,y/(b) = 
{—and Robin Boundary Conditions—a y(a) + a2y’(@) = a3, Biy(b) + Gay’ (b) = B3 
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In the first and last entries of the right-hand-side vector, we have used the fact that 
sin0 = 0 and sin3x/2 = —1, respectively. The solution of this tridiagonal linear 
system is 


w =[ —1.023672 —0.935445 —0.560486 0.00995175 0.519840 ]”. 


The following table compares this approximate solution with the exact solu- 
tion: u(z) = — cosz + (3/8) sing — (1/8)sin3z. The accuracy of the approximate 
solution is quite reasonable given the crudeness of the computational grid. 


Approximate Exact 


Lj Solution, w; Solution, u; Absolute Error 
0 -1.023672 -1.000000 0.028672 
m/8  -0.935445 -0.895858 0.039587 
w/4 — -0.560486 -0.530330 0.030156 
37/8  0.00995175 0.0116068 0.001655 
m/2 0.519840 0.500000 0.019840 


A numerical verification of the second-order accuracy of the scheme is pre- 
sented in the next table, which displays both the maximum absolute error and 
the root mean square (rms) error in the approximate solution as a function of the 
number of subintervals, V. The rms error was computed according to the formula 


rms error = 


Note that with each doubling of N, the step size is cut in half, and the approximation 
error is reduced by roughly a factor of 4—which is what one would expect from a 
second order numerical method. : 

N Maximum Absolute Error Error Ratio rms Error Error Ratio 


4 0.0395865088 0,0262038549 

8 0.0094846260 4.173755  0.0064356848 4.071650 
16 0.0023587346 4.021065  0.0016105638 3.995920 
32 0.0005899465 3.998218  0.0004038284 3.988239 
64 0.0001473906 4.002605  0.0001011678 3.991667 
128 0.0000368515 3.999585  0.0000293223 3.995214 
256 0.0000092125 4.000162  0.0000063346 3.997450 
512 0.0000023031 4.000031 0.0000015842 3.998676 


Handling an Artificial Singularity 


Consider the boundary value problem 


de Af 8 
ee Bate 


u'(0) = u(1) = 0. 
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Note the coefficient of the first derivative is singular at the left endpoint of the 
problem domain, 2 = 0; however, the exact solution to the problem, 


u(t) = 2In (3) ; 


is not singular at z = 0. We could have anticipated this eventuality since the 
Neumann boundary condition at « = 0 implies that 


Using 
L’Hépital’s 
Rule 
wu (2) L w(t) on 
gg ek ie 
Sa 


0/0 form 


so the first derivative term in the differential equation is not singular. Such a 
situation is referred to as an artificial singularity. Artificial singularities occur 
frequently in problems in polar, cylindrical, and spherical coordinates, 

How do we handle the artificial singularity in this problem, within the context 
of constructing a finite difference approximation? Once we establish a computa- 
tional grid, at every grid point but 29 = 0 we can employ our standard finite 
difference procedures (summarized in Table 8.1) to generate N of the N + 1 finite 
difference equations. At x = 0, we cannot use Table 8.1, since p(0) is undefined. 
However, making use of equation (2), we find that, in the limit as z — 0, the 
differential equation reduces to 


w'(0) + u"(0) = (sta) =1, 


or u”({0) = 1/2. If we replace the second derivative in this expression with its 
second-order finite difference formula (which will involve the use of a fictitious 
node), drop the truncation error term, and use the Neumann boundary condition 
to eliminate the fictitious node, the finite difference equation for x = 0 is found to 
be 


l 
2wWo —2wi = 5h. 


The details of this derivation are left as an exercise. 

It is important to note that we can arrive at the same system of equations, 
including the correct equation corresponding to z = 0, if we consolidate the equa- 
tions 


u’(0)=1/2 and wu” ES i = g : 
Lz 8— x? 
into the single differential equation u” = p(x)u’ + r(x), where 


z=0 


1 
ple)={ Os Be and r= | , Las 
x? 8-2? } > 
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This procedure will allow us to handle problems with artificial singularities using 
the material contained in Table 8.1. Proceeding in this manner for the current 
problem, we obtain the results tabulated below, which demonstrate that the scheme 
maintains full second-order accuracy. 


Maximum Absolute 


N Error Error Ratio rms Error Error Ratio 
4 0.0016976480 0.0012738555 

8 0.0004369617 3.885119 0.0003270583 3.894889 
16 0.0001101553 3.966778  0.0000824331 3.967562 
32 0.0000276039 3.990575  0.0000206671 3.988619 
64 0.0000069055 3.997364 0.0000051726 3.995466 
128 0.0000017267 3.999271 0.0000012938 3.998011 
256 0.0000004317 3.999800 0.0000003235 3.999073 
512 0.0000001079 3.999946 0.0000000809 3.999554 


Application Problern 1: Steady-State Temperature Distribution in a Pin Fin 


In the Overview to this chapter (see page 656), we developed a model for the steady- 
state temperature distribution, T(x), along the length of a pin fin with nonuniform 
cross section. Recall that the boundary value problem satisfied by T(z) is 


< (rare) - Fr(a)(T ~ Teo) = 0 


T(0Q)=%, -— KT"(L) = h(T(L) — Too), 


where r(x) is the radius of the fin cross section, h is the convection heat transfer 
coefficient, K is the thermal conductivity of the fin, To. is the ambient temperature 
of the fluid surrounding the fin, 7 is the temperature of the solid from which the 
fin extends and / is the length of the fin. Once the temperature distribution has 
been determined, the total fin heat transfer rate is given by 


gy = ~ka[r(0)PT"(0). 


Suppose a fin made from AISI stainless steel, with a thermal conductivity of 
k = 14 W/m-K, extends from a solid whose temperature is Tp = 100°C. The pin 
has a length of L = 10 cm and tapers linearly along its length from a radius of 2 
cm where the pin attachs to the solid to a radius of 1 cm at its tip; that is, 


zt 
= (2- me) cm. 
mt) ( 10/ 
The temperature of the surrounding fluid (air in this case) is To. = 20°C, and the 
convection heat transfer coefficient is h = 20 W/m? - K. 
Substituting the indicated parameter values into the model boundary value 
problem, converting all distance units to meters and rearranging into standard form 


Section 8.2 The Linear Problem with Non-Dirichlet Boundary Conditions 681 


100 


7 7 T T 


95 


90 


& 
LY 


Temperature °C) 
2 
8 
= 
L 


75 < 
NS 
_ 
Toh J 
65+ 4 
60! 4 1 —— 1 1 


is) 0.01 0.02 0.03 0.04 0.05 0.08 0.07 0.08 0.09 0.1 
Distance along axis of pin fin (meters) 
Figure 8.7 Temperature distribution along length of a pin fin with 
non-uniform cross section and experiencing convective heat loss from its 
lateral surface. 


yields 


v2, , 20000 ,, 400000 
ao -a- ‘2 140 — 7x 140 — 7x 


T(0) = 100, 207 (0.1) + 147"(0.1) = 400. 


Taking a uniform partition of the interval [0,0.1] with NV = 100 subintervals pro- 
duces the temperature distribution shown in Figure 8.7. The temperature at the 
tip of the fin (c = 0.1) is approximately 60.36°C. To determine the total fin 
heat transfer rate, we need to know T’(0). Using the formula for the second-order 
forward difference approximation to the first derivative, we calculate 


_ ~8T(0) + 47 (0.001) ~ (0.002) _ 
ms 0.002 = 


K 
T’(0) 825.69—. 
m 


Accordingly, 
af = —147(0.02)?(—825.69) = 14.53 W. 


Application Problem 2: The Heat Pack 


A heat pack is in the shape of a thin circular cylinder with radius r and thick- 
ness T, as shown in Figure 8.8(a). When the pack is squeezed, a bubble inside 
the pack breaks releasing chemicals that initiate an exothermic reaction. We will 
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(a) 


Figure 8.8 (a) Heat pack in the shape of a thin circular cylinder. 
(b) Energy balance on an arbitrary control volume of the heat pack. 


suppose that the generation of heat resulting from this reaction is uniformly dis- 
tributed throughout the pack. Since the pack is thin, we will further assume that 
temperature variations across the thickness are negligible. 

To model the temperature variation in the radial direction, we perform an 
energy balance on the control volume shown in Figure 8.8(b). This gives 


_ oar 

dr 

where g, is the rate of heat transfer due to conduction, dcony is the rate of heat 

transfer due to convection, g is the rate of heat generation per unit volume, and 

Aé@ is the angle subtended by the control volume. In this geometry, Fourier’s law 
gives 


Ar — 2genay + Gr Ar AGt = 0, (3) 


d 
dr = kro, (4) 


where K is the thermal conductivity of the material inside the heat pack. The 
convection heat transfer rate is 


Geonv = hrArAé(T y Too) (5) 


where h is the convection heat transfer coefficient and Tj. is the temperature of the 


air surrounding the heat pack. 
Substituting (4) and (5) into (3) and dividing by rArAét yields 


eM — then, (RES 2 : 
te a ee (6) 


Symmetry about r = 0 gives rise to the boundary condition T’(0) = 0, while 
convective heat loss at the outer edge of the cylinder leads to the boundary condition 


-kT"(R) = h(T(R) — Too). 
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Observe that equation (6) has an artificial singularity at r = 0. Taking the limit 
of (6) as r approaches zero and using T’(0) = 0, we find that, at r = 0, the governing 
differential equation reduces to 


“ h hT co q\_ 
T"(0) ea) + (= + os) = 0. (7) 
Thus, we can identify the coefficient functions, for use with Table 8.1, as 
soils 0; r=0 _ f h/kt, r=0 
HOS) Ay taps (Oy on p25) and 


_ f (AT ./kt + 9/2k), 7 =0 
he: —(2AT oo /kt+4/k), r#0° 


Note that we have denoted the nonhomogeneous term by f, rather than r, because r 
is already serving duty as the independent variable in this problem. 

Suppose the heat pack has a radius of R = 10 cm and thickness of t = 0.6 cm. 
The convection heat transfer coefficient is h = 20 W/m? -K, and the temperature 
of the air surrounding the heat pack is Ty, = 20°C. Take k = 0.4 W/m: K as 
the thermal conductivity of the material inside the pack. If the chemical reaction 
releases 30 W of heat energy, then 


30 _ 500000 


4= 5(.1)2(0.006) om 


With these parameter values and a uniform partition of [0,0.1] with N = 100 
subintervals, the temperature profile shown in Figure 8.9 is obtained. 


EXERCISES 


1. Derive the finite difference equation 


—Qwn-y + j2+ h°qn + (2- how he wN = -h'ry +(2- how) hi 


corresponding to the Robin boundary condition 6,y(b) + Goy'(b) = 63, where 
xz = bis the right endpoint of the problem domain. 
2. Derive the finite difference equation 


i; 
2wo — 2wW), = —3¥ 


corresponding to « = 0 for the boundary value problem 


Recall that at z = 0, the differential equation reduces to u”(0) = 1/2. 
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Figure 8.9 Radial temperature distribution in heat pack. 


In Exercises 3-10, 


(a) identify the interval [a,b] and the functions p, g and r; and 
(b) using a uniform partition of [a.b] with N = 4 subintervals, write out the system 
of finite difference equations. 
Note that some of these problems have artificial singularities. 


" 3 2 ! 
i = SS -— i => 
3. y a4a! ~ @ta y(0) =y (1) =0 

4. * (wy’)' =1, y'(0)=0, y(1) =10 


5. y” = 120y — 2560, 10y(0) + 35y’(0) = 200, (0.15) = 20 


f 
: 2 ( *y’) =l+2", y/(0)=0, 4y(1)+y'(1) = 40 
7. wy” — (2¢ + 1)y’ + (e+1)y=0, y(1)— 2y'(1) = —6e, (3) + 9/(8) = 26e" 
8. ‘ (zy')’ +y=10, y(t) = 20, y(3) +y/(3) =10 


1 
9. Jae aa oe y (Or=O, gia 


10. (l1—-acotz)y”—ay’+y =0, y/(1) =1+4c08(1), y(2)+y’(2) = 3-+sin(2)+4-cos(2) 


In Exercises 11-16, approximate the solution of the indicated boundary value prob- 
lem using the finite difference method and confirm the second order accuracy of the 
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numerical method. Explain any unusual behavior. 


11. 


12, 


13. 


14, 
15. 


16, 
17. 


18. 


v-y=1, yO=0, v+¥M=1, ve)=e 4 (1-s)e*-1 


1 
= (0 ul) =-1, v()=0, uQ)=1, ul) = 507-0") 
zy” — (22+ Vy +(r+1)yy=0, y{1)~ 2y’(1) =—6e, (3) + y'(3) = 26e°, 
y(w) = (1+ 2°)e” 
y" — Qey’ + 2y=—-1, y(-1)=0, y(l)=1 
y +ayt+y=2", y(0)+y(0)=0, y(l)=1 
1 
y+oyt+y=1, y()=0, y()=1 
One component of a model for a styrene monomer tubular reactor is the steady- 


state temperature profile of the solid phase catalyst. The governing boundary 
value problem is 


am aae 
adr 

7! #0, 7(l)=1 
az | 9 


Here, t = (T.—T)/(Tw —T) is the nondimensional catalyst temperature, x is the 
nondimensional radial position, T is the constant temperature of the fluid in the 
reactor, and Ty, is the temperature at the wall of the reactor. The parameter eB? 
is given by 

R°RA 
kl —.)’ 


where R = 1.3 cm is the radius of the reactor, h = 107° calfem?-s-°C is 
the heat transfer coefficient, ¢ = 0.36 is the porosity of the packed bed reactor, 
A=15 cm"! is the surface area of the catalyst per unit volume, and k = 0.0034 
cal/cm-s-° C is the thermal diffusivity of the catalyst. Approximate r(x) using 
Az = 0.0025. 

A thin cylindrical fiber, ten inches in length, has its left end maintained at a 
constant temperature Tp and experiences convective heat loss along its lateral 
surface and from its right end. The temperature within the fiber is governed by 
the boundary value problem 


eee G f) 4+2hrT = 2hrTo, 


B= 


eh. 2 arGy = Te 
az | os 
The parameters in this problem are the thermal conductivity of use fiber k = 
2 ae /sec-in-° F, the convective heat transfer coefficient h = 10° BTU/sec - 
n®.°F, the radius of the fiber 


T(-5) = To, 


0.1 
=0. ——; | inch 
r = 0.002 (1+ sp, inches, 
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19. 


20. 


21. 


22. 
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the ambient temperature of the surroundings Too = 50°F and the constant 
temperature maintained at the left end of the fiber To = 200° F. Determine the 
temperature within the fiber at increments of 0.1 inches. 


Rework the first application problem from this section, “Steady-State Temper- 
ature Distribution in a Pin Fin,” for an iron fin with thermal conductivity 
k = 80 W/m-K. Use the values given in the text for all other parameters. 


Rework the first application problem from this section, “Steady-State Temper- 
ature Distribution in a Pin Fin,” for a copper fin with thermal conductivity 
k = 401 W/m-K. Take 


r(x) =2-2 (F) + ey cm, 


and use the values given in the text for all other parameters. 


Reconsider the second application problem from this section, “The Heat Pack.” 
Calculate the radial temperature profile for convection heat transfer coefficients 
of h = 5, 10, 15, and 25 W/m? -K. Use the values given in the text for all 
other parameters. What effect does changing the value of h seem to have on 
the resulting temperature profile? Examine the temperature at the center of the 
pack, the length of the nearly constant portion of the profile, and the temperature 
at the outer edge of the pack. 


Suppose we had chosen to handle Neumann and Robin boundary conditions by 
replacing the derivative that appears in the conditions with a first-order finite 
difference formula. This would mean, for example, that the condition y'(a) = a 
would translate into the finite difference equation wo — wy = —ha, assuming 
that 2 = a were the left endpoint of the domain. As stated earlier, following 
this approach maintains the tridiagonal structure of the coefficient matrix but 
introduces O(h) errors where all other errors were O(h”). In this exercise we will 
investigate the effect of these lower-order errors on the overall accuracy of the 
approximation. 
Consider the boundary value problem 


u’+u=-sin3z, zx € [0,7/2] 
u(0) + u/(0)=—-1, w'(n/2) =1, 


whose exact solution is u(z) = — cos z + (3/8)sina — (1/8) sin3z. Use Table 8.1 
to construct the system of finite difference equations for this BVP, but replace 
the first equation by (1 — h)wo — wi = A and replace the last equation by 
wy ~ wy, = h. By computing the approximate solution for various values 
of N and comparing with the exact solution, numerically estimate the order of 
convergence of this modified finite difference method. 
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FINITE DIFFERENCE METHOD, PART IIi:;: NONLINEAR PROBLEMS 


Having treated linear boundary value problems in detail, we now turn our attention 
to the nonlinear boundary value problem 


y"=f(ayy), 2 € [a,b] 
ay y(a) + agy’(a) = ag 
Bry(b) + Boy'(b) = Bs. 


For completeness, we note that when 
e f, Of /Oy and Of /dy’ are continuous on the set 


D={(z,y):asasb,y,y €R}; 


e Of /dy > 0 for all (z,y,y’) € D; and 

e there exists a constant ! such that |Of/dy’| < L for all (z,y,y/) € D 
the given boundary value problem is guaranteed to have a unique solution. For a 
proof of this result, see Keller [1]. 

The techniques that were introduced in the previous two sections for de- 
veloping finite difference equations can still be applied to a nonlinear differential 
equation; however, the resulting system of algebraic equations will be nonlinear. In 
this section, we will derive the second-order finite difference equations associated 
with the general nonlinear boundary value problem cited above. The solution of 
the nonlinear algebraic equations will also be discussed. 


Dirichlet Boundary Conditions 


Let’s begin our investigation of finite difference methods for nonlinear boundary 
value problems by considering the case of Dirichlet boundary conditions: 


First, we introduce our standard computational grid, which is defined by 
Ga=atth (@=0,1,2,...,.N), h=(b—a)/N. 
Next, we evaluate the governing differential equation at an arbitrary interior 
grid point x = a, 
ty" =f yy Meawy 
and substitute the second-order finite difference formulas 
JS OR, ‘ 
y” - Yi-1 t + Yi+1 + O(h?) 


y= Yi-+1 5 oa O(h?) 
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Dirichlet Dirichlet 
boundary boundary 
J condition condition % 
Xypaa X, X, rae Xy =b 
a 
Approximate solution unknown 
at all interior grid points 


for the derivatives. The truncation error terms are then dropped, and the y’s (exact 
solution) are replaced by w’s (approximate solution). The resulting equation is 


Wi-1 — 2w;i + wig Wit. — Wi-1 
er f (sous Sp . (1) 


With the assumed bound on Of /Oy’, the overall truncation error for this approx- 
imation is O(h?). Rearranging equation (1) produces the computational template 


2h 
which holds for 1 = 1, 2, 3,..., NM —1. To these equations, we add 


Wig) — Wi 
wy + Qu; — wig + hf (ca, as = 0, (2) 


wo=a and wy =8, 
both of which are obtained from the boundary conditions. 
If we let 


and 


where 
Wi —~ We : 
gw) = —wi_y + 2w; — wig +h’ f (2m me) (i = 1,2,3,...,.N—1) 


gn(w) =WUwN — B, 


then the system of finite difference equations can be written in vector form as 
G(w) =0. 


This is a nonlinear system of equations that we will solve with Newton’s method. 
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Recall from Section 3.10 that Newton’s method for the general system of m 
equations, F(x) = 0, generates the sequence {x("+#)} according to the rule 


(PAD = ye) 4 yl) 
The update vector, v), is the solution of the linear system 
[7¢x(")| vi) = F(x), 


where J is the Jacobian matrix for the system. The Jacobian is given by 


Ofi/Ox, Of; /Oxq Of\/Oe3 + + + Of; /Otm 
Ofo/Ox, Ofz/Oxq Of2/Oaz - + + Ofe/Otm 
Of3/Ox, Ofs/Ox2 Of3/Ox3 he oI Ofs/Oxm 
J(x) = : : 
Ofm [O21 8 fin [Oy Ofm/O%3 °° - 8 fn [tem 


Since each iteration requires the solution of a linear system of equations, if the 
Jacobian matrix for the nonlinear equations is a full matrix, Newton’s method will 
be very expensive. 

Fortunately, the Jacobian matrix for the system of finite difference equations 
G(w) = 0 has aspecial structure. With the exception of the first and last equations, 
the finite difference equations take the form 


‘ Wit1 — Wi 
gi(w) = —Wjy_-1 + 20; — Wig + h? f (aius, ee F 
Since g;(w) depends only on the unknowns wi_1, w; and wi4;, the only nonzero 
entries along the ith row of the Jacobian will be 


Og; h Of Wit) — Wi-1 
dui. SCO OY (2s, 2h 
oa =2+ no (sims, mee , and 
a 
09; h of Wit. — Wi} 
Owi4i er 20 : (2. = 2h 


The first and last rows of the Jacobian will contain a 1 along the main diagonal 
and zeros everywhere else. 
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Therefore, the overall structure of the Jacobian will be 


1 O 
Lo ody w 


lo d2 U2 


Iy-2 dy-2 UN-2 
in-i1 G@n-1 UN-1 
0 1 


where 


oy 2h 
h of Witt — Wirt 
= ltoge | tw, 
+ 3 By! € 2h 
h of Wi4+1 — Wi-1 
l; = —1 SAC e 4) Gy 
2 By! (= 2h ) 


Note that this is a tridiagonal matrix. The number of operations needed to solve 
a linear system with a tridiagonal coefficient matrix is only O(n)—a significant 
savings over the general case. 

The final issue to discuss regarding the use of Newton’s method to solve the 
finite difference equations is the choice of the initial vector w. Unless a previous 
approximate solution is available, whenever Dirichlet boundary conditions have 
been specified, the simplest scheme for obtaining an initial vector is to pass a line 
through the points (a, a) and (6, 8) and evaluate that function at each « = z;. This 
procedure yields 
(0) _ B-@ 
i b=a 


or, after substituting a; = a+ th and h = (b—a)/N, 


w (ai —a)+a, 


EXAMPLE 8.4 A Sample Nonlinear Boundary Value Problem 


Let’s use the finite difference method to approximate the solution of the nonlinear 
boundary value problem with Dirichlet boundary conditions 


yy” +(y/P +1=0 
y1)=1, y¥(2)=2. 
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Solving the differential equation for the second derivative, we find 


pec 
y > 
so that ; 
1+ {(y’ 
flawy) = eae 
¥ 
Applying the basic rules of differential calculus, we compute the partial derivatives 
of i ae) of ' ay" 
BN, =—>— and BAY )=-—. 
By rey) 3 Ay ) ‘ 


Thus, the elements along the diagonal of the Jacobian are given by 


2 
Wit1 Wig 
I + ( oh :) 


4h? + (wiry — wii)” 
= 2 = iL ~ Wi-a 
while the off-diagonal elements are given by 

Reo h _ gee See Witt — Wi-1 


and 


peace (a) yy WHT ea 
2 Wy 2W; 
Using a uniform grid with N = 8 subintervals, we obtain the results listed in 
the second column of the following table. A convergence tolerance of TOL = 5 x 
10-14 was used to terminate the Newton iterations. The values in the third column 
were obtained by evaluating the exact solution to the boundary value problem, 


y(x) = /6r —4 ~ 2?, at each grid point. 


Approximate Exact 
i Solution, w; Solution, y, Absolute Error 


1.000 1.000000 1.000000 0.000000 
1.125 = 1.217747 1.218349 0.000602 
1.250 1.391239 1.391941 0.000702 
1.375 1.585871 1.536026 0.000655 
1.500 1.657760 1.658312 0.000552 
1.625 1.762916 1.763342 0.000426 
1.750 = 1.853761 1.854050 0.000289 
1.875 1.932307 1.932453 0.000146 


2.000 2.000000 2.000000 0.000000 
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We complete this example by demonstrating the second-order accuracy of the nu- 
merical method. The next table lists the maximum absolute error and the root 
mean square (rms) error in the approximate solution as a function of the number 
of subintervals in the computational grid. With each doubling of the number of 
subintervals, the step size is cut in half. For a second-order method we would ex- 
pect the error to decrease by a factor of 27 = 4. The error ratio values listed in the 
third and fifth columns are clearly approaching the expected value. 


N Maximum Absolute Error Error Ratio rms Error Error Ratio 


4 0.0025761053 0.0015533493 

8 0.0007021651 3.668803 0,0004564927 3.402791 
16 0.0001804347 3.891520  —0.0001209483 3.774280 
32 0.0000454487 3.970073  0.0000309207 3.911566 
64 0.0000113841 3.992300  0.0000078033 3.962232 
128 0.0000028474 3.998060  0.0000019594 3.982742 
256 0.0000007120 3.999337  9.0000004909 3.991779 
512 0.0000001780 3.999875 0.0000001228 3.995992 


Neumann and Robin Boundary Conditions 


To modify our finite difference method to handle non-Dirichlet boundary conditions, 
we will have to derive a new function go(w) and/or a new function gn(w), and, 
accordingly, determine a new first row and/or a new last row for the Jacobian. To 
maintain the second-order accuracy of our scheme and to maintain the tridiagonal 
structure of the Jacobian matrix, we will once again make use of fictitious nodes. 
Furthermore, since a Neumann boundary condition is just a special case of a Robin 
boundary condition, we will focus our attention on the equations corresponding to 
Robin conditions. 

Let’s start at s = zp. Applying the computational template given by equa- 
tion (2) with i = 0 yields 


—wys + 29 — wy + h? f (20,0 oe = 0. (3) 
We must, of course, eliminate wy from this equation. The Robin boundary condi- 
tion a,y(a) + a2y'(a) = a3 leads to the finite difference equation 
clam ere 
2h 
An intermediate result arrived at during the solution of this equation for wy is 
Wy — WF = a3 — &%) Woe 


; (4) 
ah a2 
which can be recognized as the third argument to the function f in equation (3). 
Completing the solution for wy yields 


&\ Wo + a2 03. 


wp = Ww - 2n@3 + 2h~Lw9. (5) 
Q2 a2 
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Substituting equations (4) and (5) into equation (3) gives 


a3 — A1Wo 


go(w) =2 (1 = ne) Wo - 2W, + hf (0,10, ) Soin OF. (6) 
[as] ag 


a2 
For a Neumann boundary condition at + = 29, a = 0 and equation (6) reduces to 
go(w) = 2wo — 2u, +h? f (20, wo,@) + 2he, (7) 


where we have written a for the ratio a3/a%. 


Proceeding in an analogous manner, the Robin boundary condition 6,y(b) + 
Boy’ (b) = B3 leads to 


gn(w) = ~2wy-1 42 (1 + ne) wn +h? f (2xvww, Pon) ones, (8) 
b bo Ba’ 
while the Neumann boundary condition y'(b) = 6 leads to 
gn(w) = —2wy—1 + 2wy +h? f (en, wn, 8) — 2h8. (9) 


The derivation of equation (8) is left as an exercise. 

Having determined the functions go(w) and gn(w) for Robin and Neumann 
boundary conditions, we can now compute the corresponding entries along the first 
and last rows of the Jacobian matrix. By design, the only nonzero entries along 
the first row are 


90 ogo 
Ay =— and Ao =, 
1 = Bing 1,2 Bi 
and the only nonzero entries along the last row are 
d9n d9n 
J = d J =. 
N+1,N Piney an N+1,N41 Bw 


From equations (6) and (7) 


2(1- As) +h?2L(r0,wo,4) —h?& BF (ao, wo, 2) Robin 
A= Neumann, 
2+ h? ZL (ao, wo, &) 


where & = (a3 ~ a wo)/a2, and Ji,9 = —2. From equations (8) and (9) 


2(1+ng) + nL (en, ww, B) - n?& Sf (an, ww, 8) Robin 
Jn+i,N+1 = Neumann, 
2+ h? SL (an, ww, B) 


where § = (83 ~ Aww) /Be, and Jntin = —2. 
The selection of the initial Newton iterate, w©), is not as straightforward for 
non-Dirichlet boundary conditions. In most cases, we can still use the boundary 
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conditions to determine a linear function that can then be evaluated at the grid 
points to produce an initial vector. For instance, the boundary conditions 


y/(a)=a and fy, y{b) + Boy’(b) = Bs 


determine the linear function 


axt+ (8 — ab). 


There are situations, however, when a linear fit to the boundary conditions is 
not possible, such as when Neumann conditions are specified at both endpoints, 
with different values for the derivative. After all, a linear function can have only 
one slope. Furthermore, even when a linear function can be fit to the boundary 
conditions, the resulting function may not produce an appropriate initial vector. 
Consider the boundary value problem 


y= -(1 + (y')?\/y 
yQ)-y()=-1, (2) =1/2. 


The function (1/2)z — 1 satisfies the boundary conditions, but unfortunately evalu- 
ates to zero at z = 2, which generates division by zero in the differential equation. 
In these cases (the linear fit fails or does not provide an appropriate initial vector), 
it may be best to just use an arbitrary constant vector as w). The bottom line 
is that for nonlinear boundary value problems with non-Dirichlet boundary con- 
ditions, some trial and error may be necessary to find a good starting vector for 
Newton’s method. 


EXAMPLE 8.5 A Problem with One Robin and One Neumann Boundary 
Condition 


Consider the nonlinear boundary value problem 
y’-dy=0, 2€ (0,1 
3y(0) — 9y'(0)=2, y(t) = -1/16. 


The exact solution for this problem is y(z) = 1/(z + 3). 
Using a uniform partition with N = 8 subintervals, the results shown in the 
first table were obtained. The initial vector for Newton’s method was taken as 


These values were determined by evaluating the linear function that satisfies the 
boundary conditions at each grid point. A convergence tolerance of TOL =5*x 
10-'4 was used to terminate the iteration. 

The second table demonstrates the second-order accuracy of the numerical 
method. Each time the number of subintervals in the uniform partition was dou- 
bled, the maximum absolute error and the rms error were reduced roughly by a 
factor of four. For each value of N, the initial vector and convergence tolerance 
cited above were used. 
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Approximate Exact 
Ly Solution, w; Solution, y, Absolute Error 


0.000 0.333224 0.333333 0.000109 

0.125 0.319909 0.320000 0.000091 

0.250 0.307617 0.307692 0.000076 

0.375 0.296234 0.296296 0.000062 

0.500 0.285664 0.285714 0.000050 

0.625 0.275822 0.275862 0.000040 

0.750 0.266636 0.266667 0.000030 

0.875 0.258043 0.258065 0.000022 

1.000 0.249986 0.250000 0.00001 4. 
N Maximum Absolute Error Error Ratio rms Error Error Ratio 
4 0.0004333909 0.0002590402 
8 0.0001093182 3.964488 0.0000629312 4.116244 
16 0.0000273913 3.990991 0.0000154601 4.070552 
32 0.0000068517 3.997739  0.0000038283 4.038336 
64 0.0000017132 - 3.999434 0.0000009523 4.019922 
128 0.0000004283 3.999859 0.0000002375 4.010148 
256 0.0000001071 3.999965 0.0000000593 4.005121 
512 0.0000000268 3.999989 0.0000000148 4.002569 


Application Problem: Spatial Distribution of an Insect Population 


In the Overview to this chapter (see page 658), we showed that under certain as- 
sumptions (one-dimensional diffusion, Fick’s law, logistic growth, etc.) the nondi- 
mensional steady-state population density, n(x), of an insect population that has 
been released into a cylindrical environment satisfies the boundary value problem 


n“+n{(l—n)=0, n(0)=0, n’(l)=0. (10) 


Here, primes denote differentiation with respect to x, £ measures nondimensional 
distance along the axis of the cylinder, and / is the nondimensional length of the 
cylinder. Clearly, the so-called trivial solution, n(x) = 0, is a solution of (10). It 
can be shown (see, for example, Ludwig, Aronson and Weinberger [2]) that the 
trivial solution is the only solution to (10) whenever / < 1/2; but, when 1 > 7/2, 
there exists a unique nonnegative nontrivial solution. 

The nonnegative nontrivial solution corresponding to / = 3 is displayed in the 
top graph of Figure 8.10. This solution was determined using a uniform partition 
of [0,3] with N = 300 subintervals. The initial vector for Newton’s method was 


= (24 - 3)? 
zt 9 3 
which was obtained by evaluating, at each of the grid points, the quadratic poly- 
nomial satisfying both boundary conditions and taking the value 1 at + = 3. (Why 
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Figure 8.10 Numerical results for the “Spatial Distribution of an In- 
sect Population” problem. (Top graph) Nonzero steady-state population 
density distribution for | = 3. (Bottom graph) Maximum population 
density as a function of cylinder length, !. 


would the linear function that satisfied the boundary conditions not have provided 
an appropriate choice for the initial vector?) A convergence tolerance of 5 x 10710 
was used to terminate iteration. 

The bottom graph in Figure 8.10 displays the maximum value of the popula- 
tion density as a function of the cylinder length, /. For / < 1/2, we know that the 
density is zero everywhere; hence, the maximum density is zero. The remainder of 
the graph was obtained by solving (10) for values of | ranging from 1.6 to 10.0 in 
increments of 0.1 and recording the maximum value of the resulting density profile. 
For each J, calculations were performed with 


ei? 


N=100, wi) =1-—*, 


and a convergence tolerance of 5 x 107!°. Note that for | > 5.8 the maximum 
density is within one percent of the carrying capacity. 
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EXERCISES 
1 


2 


3 
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. Determine the conditions under which a linear function cannot be fit to the 
Dirichlet boundary condition y(a) = @ at the left endpoint and the Robin bound- 
ary condition §iy(b) + Goy’(b) = Gs at the right endpoint. 


. Derive Equation (8) -- i.e., the function ga(w) corresponding to a Robin bound- 
ary condition at the right endpoint of the problem domain. 


. If the function f(z, y, y’} is of the form 
f(ey.9') = pla)y’ + alz)y + r(z) 


for some functions p, q and r, show that the formulas for d;, u;, and J; reduce to 
formulas (4), (5), and (6) from Section 8.1. 


he boundary value problems in Exercises 4-9, 


(a) identify the function f(z, y,y’) and compute the partial derivatives Of /Oy and 


Of fOy’; 


(b) for N = 4, write out the systern of finite difference equations. 


4. 


Cm N DH 


y’+GYP=1, y0)=1, yl) =2 

ty ty’ =0, y0)=0, y+) =1 

y! = —2y? — 4eyy', y(0}=1, yfl) = 1/2 

y+ 4yy' = -2y/(L +27), yO) ty) =1, y() =0 
ay’ +2y? —&e7y? =0, y(O)=1, (1) =-1/2 

py =by’, yd)-y()=0, yl) ty) =3 


In Exercises 10-16, approximate the solution of the indicated boundary value prob- 
lem using the finite difference method and confirm the second-order accuracy of the 
numerical method. 


10. 
11. 
12, 
13. 
14. 
15. 
16. 
1, 


yw =—-G4G' ey, yy) =-1L y'(2)=1/2, lz) = Vor - 4-2? 
yf! = —2y? + 827y9,  y(0)=1, yl) =1/2, ye) = 1/1 +27) 

y! +4yy' = -2y/(1+2"), yO) =0, y(0)=1/2, we) =2/{1+2°) 

yt =1, y¥(0)=0, y(1)=2 

y+yy ty’ =0, y0)=0, yQ)+y(1)=1, yz) =1-e™* 

y” +4eyy = —-2y7, yO)tv(@=1, y)+y(1)=0, (2) =1/1 +27) 
yy’ =y'-1, y=0, y(2)=14+n2 

Consider the nonlinear differential equation 


aay” + (y')? — 4y = 4e. 


Use the finite difference method to solve this differential equation subject to each 
of the following sets of boundary conditions. In each cage, the exact solution is 
y(z) = (2 + 1)*. How rapidly does the approximate solution converge toward 
the exact solution as a function of the number of subintervals? Provide an 
explanation for your observation. onGh GR 
(a) y(t) =4, y(2) = 9 (b) yl) = 
(c) y'(1) =4, y'@) =6 (d) yl 
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Ramirez (Computational Methods for Process Simulation, Butterworths, Bos- 
ton, 1989) develops the following nonlinear boundary value problem for the con- 
centration, y(a), of the reactant in a second-order chemical reaction taking place 
within a tubular reactor with dispersion: 


=, EO as 2 — 
i 105 Sy 
1 dy dy 
Oye ee. aU . 
an 10 dar «=0 7 da: gal : 


Determine the concentration in increments of 0.01 along the length of the reactor. 
C. Philipsen, 8. Markvorsen, and W. Kleim [“Modelling the Stem Curve of a 
Palm in a Strong Wind,” SIAM Review, 38 (3), pp. 483-484, 1996] present 
the following model for the angle of the stem of a tall palm tree, relative to its 
vertical position, when the tree is subjected to wind loading: 

d’9 $\, ; 

Elva = -—W, (1 — z) sind ~ W.sin8 ~ Dcosé. 

Here, @ is the angle of the stem relative to the vertical position, and s is arc 
length measured along the stem. The parameters in the model are the total 
stem weight Ws; = 22700 N, the Young’s modulus of the stem E = 0.9 x 109 
N/m?, the moment of inertia of the stem I = 5.147 x 1074 m*, the length of 
the stem EL = 30 m, the total canopy weight W, = 1385.5 N, and the wind drag 
force on the canopy D = 1.2405U° N, where U is the wind speed in m/s. The 
boundary conditions imposed on the stem angle are 


O(0)=0 and 6'(L)=0. 


Determine the function #(s) when the wind speed is 8 meters/second. 
Subramanian and Balakotaiah (“Convective Instabilities Induced by Exothermic 
Reactions Occurring in a Porous Medium,” Phys. Fluids, 6 (9), pp. 2907-2922, 
1994] develop the boundary value problem 


d*6 2 g 70 \ 
ae + B¢ (1-3) ew (2 =0 


6'(0)=0, @(1)=0 


for the steady-state temperature profile, 6{z), in a porous medium undergoing 
an exothermic reaction. The parameter B is the maximum possible temperature 
in the absence of natural convection, ¢” is the ratio of the characteristic time 
for conduction to that for heat generation and ¥ is the dimensionless activation 
energy. For B = 6.0, 6? = 0.25, and y = 30.0, determine 6(z). 

In the “Spatial Distribution of an Insect Population” problem, suppose that 
rather than a barrier at « = 1, a steady influx of insects is maintained. This 
changes the boundary condition from n/(1} = 0 to n’(l) = 4, where j is a nondi- 
mensional flux parameter. For | = 3, determine the density distribution for 
values of 7 ranging from 0.05 to 0.25 in increments of 0.05. Approximate the 
value of 7 for which the maximum density is equal the carrying capacity. 
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8.4 THE SHOOTING METHOD, PART I: LINEAR BOUNDARY VALUE PROBLEMS 


The shooting method is an alternative numerical method for solving boundary 
value problems. The basic idea is to convert the boundary value problem into two 
or more initial value problems that can be solved using the techniques developed in 
Chapter 7. For linear boundary value problems, it is a simple matter to combine 
the solutions of the initial value problems to generate the solution to the original 
boundary value problem. 


Dirichlet Boundary Conditions 


Let’s start simple and demonstrate the technique for a boundary value problem 
with Dirichlet boundary conditions: 


y” = p(x)y' + a(x)y + r(2) 
y(a)=a, yb) =8. 


The basis for the shooting method is as follows. The above boundary value problem 
is almost an initial value problem. We have the differential equation and the value 
of the solution at x = a. The only piece of information that is missing is the value 
of the first derivative at s = a. Why not then guess a value for y’(a), use any 
available initial value problem solver to march the solution out to z = b and check 
whether the boundary condition at x = b has been satisfied? If it has, then we have 
found the solution to the boundary value problem; if the boundary condition at 
z = b has not been satisfied, then we make a “better” guess for the value of y/(a) 
and repeat the process. This approach essentially transforms the boundary value 
problem into a rootfinding problem. When we work with nonlinear problems in the 
next section, we will use precisely this approach. 

For linear boundary value problems, however, a slightly different plan of at- 
tack will produce the approximate solution in a much more direct manner. The key 
observation is that every solution to a linear, nonhomogeneous differential equa- 
tion can be written as a particular solution plus a constant times a solution to the 
corresponding homogeneous problem. This suggests working with not one initial 
value problem, but two. The first of these has the original nonhomogeneous differ- 
ential equation, with the-function value at z = a given by the boundary condition 
y(a) = a and with an arbitrary value specified for the first derivative. The solution 
of this problem is not expected to match the boundary condition at ¢ = b. There- 
fore, the second initial value problem, which has the corresponding homogeneous 
differential equation subject to y(a) = 0 and an arbitrary, nonzero value for the 
first derivative, is also solved. Multiplying the solution of this second initial value 
problem by an appropriate constant and adding the result to the solution of the 
first initial value problem will allow the boundary condition at 2 = b to be satisfied. 

Let’s examine this computational scheme in detail. Consider the two initial 
value problems 

IVP1 { y” = p(a)y’ + a(x)y + r(z) 
yla)=a, yi(a) =0 


700 = Chapter 8 Two-Point Boundary Value Problems 


4 = o(x)ay + (x) 
Ive2 { Y= rlaiy' + alz)y 
{ y(a)=0, y'(a)=1 
The initial values shown for the first derivative are the standard choices. Let yi(x) 


denote the solution of IVP1 and yo(x) denote the solution of IVP2. Due to the 
linearity of the differential equation, it follows that 


y(z) = yi(x) + cya(z) (1) 


is a solution of y” = p(x)y’ + q(z)y+r(z) for any value of the constant c. Further- 
more, 


y(a) = yi(a) + cye(a) 
=ate-0=a. 


At x = b we have y(b) = y,(b) + cyo(b). Equating this value to 6 and solving for c, 
we find that with 
—yil(b 
y2{b) 
the function y(x) = y:(x) + cye(z) will satisfy the original boundary value problem. 
To summarize, the shooting method for approximating the solution of the 
linear boundary value problem with Dirichlet boundary conditions, 


y” = p(2)y’ + g(z)y +r(z) 
y(a)=a, y(b) = 8, 


consists of three steps. First, solve the initial value problems IVP1 and IVP2 using 
any initial value problem solver. It is not required that both problems be solved 
with the same numerical method. Second, using the values of y)(b) and y2(d) 
obtained in the first step and the value of # given in the boundary condition at 
x = b, compute ¢ from equation (2). Finally, pointwise combine the solutions y, (x) 
and yo(%) according to equation (1). 


EXAMPLE 8.6 Demonstration of the Shooting Method for a Linear 
Boundary Value Problem 
Consider the linear boundary value problem with Dirichlet boundary conditions 


aul! + 12u = Qn* sin(az) 
u(0) = u(1) =0. 


‘To approximate the solution of this problem using the shooting method, we first 
convert the boundary value problem into the two initial value problems 


ul” = n*u— Qn? sin(rz) 


ee { u(d)=0, w(0) =0 
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and Hw 2 
uw = HU 

it 

Me { wO=0) HOy—a 
Let u; denote the approximate solution of IVP1 and uy the approximate solution 
of IVP2. Regardless of which initial value problem solver we choose, each initial 
value problem must be converted to a system. The systems corresponding to IVP1 
and IVP2 are, respectively, 


i 
Uy = U1,2 
Uyo = 7'u,1 — 2a? sin(rz) 
and 


I 
Ug, = U2,2 
! 
U2 2 = TU). 


Using the classical fourth-order Runge-Kutta method (RK4) with N = 4 steps to 
march from z = 0 to x =1, the results given below are obtained. 

Li ur(Zi) uaa) 

0.00 0.000000 0.000000 

0.25 -0.157372 0.275702 

0.50 -1.290357 0.730213 

0.75 -4.490694 1.657343 

1.00 -11.466375 3.656793 


Note that neither solution satisfies the boundary condition at x = 1. 
Now, using equation (2), we compute 
_ 0 = (—11.466375) 
~ 3.656793 
The function w(z) = ui(x) + 3.135637ue(z) is then guaranteed to satisfy both 
boundary conditions. The value of w at each x;, w;, is given in the second column 


of the following table and is compared to the corresponding value of the exact 
solution, u(x) = sin(r2). 


= 3.135637. 


Approximate Exact 


Xi Solution, w; Solution, u,; Absolute Error 
0.00 0.000000 0.000000 

0.25 0.707129 0.707107 0.000022 
0.50 0.999327 1.000000 0.000673 
0.75 0.706132 0.707107 0.000975 
1.00 0.000000 0.000000 


The accuracy of this solution is excellent, especially considering the crudeness of the 
computational grid. Since we have used RK4 to obtain our approximate solution, we 
expect fourth-order convergence toward the exact solution. Numerical verification 
of the order of convergence is left as an exercise. 
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Other Types of Boundary Conditions 


The shooting method for linear boundary value problems that has just been de- 
scribed can also be applied when boundary conditions other than Dirichlet are 
specified. Suppose, for example, that we have the same Dirichlet condition at 
x =a, y(a) = a, but we replace the condition at x = b with the Robin condition 
Bi y(b) + Boy'(b) = 83. Since the condition at z = a has not changed, the basic 
structure of the problem remains the same: The initial value of the function is 
known, but the initial value for the first derivative must be “guessed.” Therefore, 
IVP1 and IVP2 need not change, and the solutions of these two problems will 
still be combined according to equation (1). Only the equation for computing the 
constant c must change. 
From equation (1), at z = b we have 


y(b) = yi(b) + cya(d) 
and 
y (b) = y(0) + cya(d). 
Therefore, to satisfy the boundary condition at z = b, ¢ must be selected to satisfy 
Br yi (b) + eya(b)] + Balyy (O) + cy2(0)] = Bs- 


Solving for c, we obtain 


as 83 — Bryi(b) — Bay's (b) 


3) 
Brvalb) + Bev) 
For the Neumann condition y'(b) = 6 at x = 6, the equation for c becomes 
t 
—yi(b 
igs yi ) (4) 


yo (b) 

Next, suppose we have the Neumann condition y’(a) = a at s = a. This 
changes the basic nature of the problem. Instead of knowing the initial function 
value and needing to guess the initial value for the derivative, now we know the 
initial value for the derivative and need to guess the initial function value. This 
suggests converting the boundary value problem into the two initial value problems: 


y” = p(z)y’ +4(2)y + 7r(2) 
ae { y(a)=0, y'(a) =a. 


and ; ; 
IVP2 y= p(x}y + a(e)y . 
y(ay=1, y/(a)=0 
_ Once again, let yi(z) denote the solution of IVP1 and yo(x) denote the solution of 
IVP2, and let y(z) = y1 (x) + cy2(x) for some constant c. The value of c is given by 
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equation (2), (3) or (4), depending on whether the boundary condition at z = 6 is 
a Dirichlet condition, a Robin condition, or a Neumann condition, respectively. 

Finally, suppose that the Robin condition a:y(a) + azy'(a) = ag is specified 
at + = a. Now we know neither the initial function value nor the initial value for the 
first derivative, so both will have to be guessed. This requires that the boundary 
value problem be replaced by three initial value problems: 


y” = p(x)y' + g(a)y + r(2) 
an { y(a)=0, yla)=0” 


and 


"= play’ +4(2) 
Se eau 


The solutions to these problems will be combined according to the rule 


y(z) = yi(a) + erye(z) + coys(z), (5) 


where yi(z), yo(x), and y3(az) are the solutions to IVP1, [VP2, and IVP3, respec- 
tively. To determine the appropriate values for the constants c, and ce, we have 
to satisfy both boundary conditions. From the boundary condition at z = a and 
equation (5), we obtain the equation 


[aryo(a) + ayy (a)Jer + [arys(a) + a2y3(@)]eo = a3 — aryi(@) — O29; (a), 
which simplifies to 
aycy + A202 = ag (6) 


upon substituting the initial values from IVP1, IVP2, and IVP3. The second 
equation for the constants c, and cz depends on the type of boundary condition 
specified at x = b. The three possibilities are 


yalb)er + ys(b)e2 = B — yi(d) (7) 
yo (b)er + ¥3(b)e2 = B — yy(b) (8) 
[Prye(b) + Boye (b)] er + [B1y3(b) + Boys (b))c2 = 83 — Bryi(b) — Bay, (b) (9) 


for a Dirichlet condition, a Neumann condition, and a Robin condition, respectively. 


EXAMPLE 8.7 A Problem with One Neumann and One Robin Boundary 
Condition 


Consider the ordinary differential equation 


u’+u=sin(3e), O<a<7/2 
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subject to a Neumann boundary condition at 2 = 0: 
w{0) =14 
and a Robin boundary condition at x = 1: 
u(n/2) + ul (1/2) = -1. 


With the Neumann boundary condition at x = 0, we first convert the boundary 
value problem into the two initial value problems 


u” = —u + sin(3z) 
et rADeR ul(0) =1 


and 
yl = —u 


ee { u(0)=1, w(0)=0° 


Let ui, denote the approximate solution of IVP1 and u2 the approximate solution 
of [VP2. The systems corresponding to IVP1 and IVP2 are, respectively, 


t 
Uy) = U1,2 
ulg = ~tu,1 + sin(3z) 
and 
Ug, = U2,2 
Ugg = —~U21 


The RK4 method, with N = 4 steps to march from x = 0 to z = 7/2, produces the 
results 
Ly tu (La) ut} (23) ua(xi) uy (xe) 
0 0.000000 1.000000 1.000000 0.000000 
m/& 0.411165 1.126997 0.923885 -0.382606 
m/4 0.884311 1.237806 0.707176 -0.706967 
3n/8 1.318095 0.873002 0.382859 -0.923726 
n/2 1.499586 0.000195 0.000294 -0.999900 


With a Robin condition at 2 = 7/2, we use equation (3) to compute 


__ =1 = 1.499586 — 0.000195 
© = ~~ 9.000294 — 0.999900 


The function w(z) = u(x) + 2.500766u2(x) is then guaranteed to satisfy both 
boundary conditions. The value of w at each x, wi, 3s given in the second column 
of the following table and is compared to the corresponding value of the exact 
solution, u(x) = (11/8) sin + (5/2) cosa — (1/8) sin(3z). 


= 2.500766. 
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Approximate Exact 


Uy Solution, w; Solution, u; Absolute Error 
0 2.500766 2.500000 0.000766 
m/&8 2.721585 2.720404 0.001181 
m/4 2.652793 2.651650 0.001143 
37/8 2.275536 2.274878 0.000658 
mf2 1.500321 1.500000 0.000321 


Once again, numerical verification of the order of convergence is left as an exercise. 


EXAMPLE 8.8 A Problem with a Robin Boundary Condition at z = a 


Consider the ordinary differential equation 
vu’ +ussin(32), 0O< e</2 
subject to a Robin boundary condition at x = 0: 
u(0) + u’(0) = -1 
and a Neumann boundary condition at x = 2/2: 
w(r/2) = 1. 


With the Robin boundary condition at « = 0, we replace the boundary value 
problem with the three initial value problems 


u” = —u+sin(3z) 
NE { OV 20. -al(0)—0" 
ul = —u 
IVP2 { u(0) = 1, u'(0) =0°’ 


and 


Let uy, ue, and ug denote the approximate solution of IVP1, IVP2, and IVP8, 
respectively. Converting each initial value problem into a system and then using 
the RK4 method, with N = 4 steps to march from z = 0 to x = 7/2, the results 
given below are obtained. 


Ly UL (x) U2 (x) UZ (x;) 

0 0.000000 1.000000 0.000000 
m/8 0.028559 0.923885 0.382606 
w/4 0.177343 0.707176 0.706967 
37/8 0.394369 0.382859 0.923726 
n/2 0.499685 0.000294 0.999900 
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We want to combine these three solutions into a single function: w = uy + 
Ciug + cou3. To determine the appropriate values for cy and cy, we must solve the 
system of algebraic equations composed of equation (5) with a; = a2 = 1 and 
a3 = —1 and equation (8) with 6 = 1; that is, 
ey tog =-1 
Up(m/2)ey + ug(t/2)co = 1 ~ uj (1/2). 
The derivative values at xz = 1/2 are 


u, (7/2) = —0.00009934793668 

uy (1/2) = —0.99990005047118 

ua (7/2) = —0.00029430281825, 
With these values, the constants are found to be c, = —1.000199 and 2 = 
0.000199259. Therefore, w{e) = us (x) — 1.000199u2(z) + 0.000199259u3(x). The 
value of w at each 2;, wi, is given in the second column of the following ta- 


ble and is compared to the corresponding value of the exact solution, u(x) = 
(3/8) sin « — cos x ~ (1/8) sin(3z). 


Approximate Exact 


Ly Solution, w; Solution, u; Absolute Error 
0 —1.000199 —1.000000 0.000199 
n/8 —0.895434 —0.895858 0.000424 
m/4 —0.529833 —0.530330 0.000497 
3/8 0.0116179 0.0116068 0.000011 
m/2 0.499590 0.500000 0.000410 


EXAMPLE 8.9 A Problem with an Artificial Singularity 


The boundary value problem 
2 1 
y + ae =-—l 
y'(0)=0, y(1)=1 

has an artificial singularity at 2 = 0. We handle the artificial singularity here in ex- 
actly the same manner as we did with the finite difference method (see Section 8.2); 
that is, we determine the equation that applies at x = 0 and then define the coef- 
ficient functions p, q and r in a piecewise manner. Taking the limit as « — 0, the 
differential equation becomes 

y’ (0) = -1/3. 
The appropriate coefficient functions for the numerical solution of the boundary 
value problem are therefore 


0, =0 -1/3, 2£=0 
ple) = { “of2, neo $2) =; and r(a) = { i 2#0° 
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With these functions, the two initial value problems which must be solved are 


y” = p(z)y’ + r(a) 
ee eae y'(0) = 0° 


He / 
IVP2 y" = p(a)y 
{ y(0)=1, y/(0)=0 
Converting these problems into systems and using the RK4 method with N = 4, 
we obtain the approximate solutions 


yi y1 (<i) y2(xi) 
0.00 0.000000 1.000000 
0.25 —0.010417 1.000000 
0.50 -0.041667 1.000000 
0.75  —0.093750 1.000000 
1.00 --0.166667 1.000000 


With a Dirichlet boundary condition at the right endpoint, equation (2) is used to 
corapute 
1-—(-0.1 
c= pac 1600e7) = 1.166667. 
The value of w(x) = y,(z) + 1.166667y2(z) at each z; is given in the second column 
of the following table and is compared to the corresponding value of the exact 
solution, y(x) = (7 — 2”)/6. 


Approximate Exact 


Xi Solution, w; Solution, y, Absolute Error 
0 1.166667 1.166667 0 
0.25 1.156250 1.156250 0 
0.50 1.125000 1.125000 0 
0.75 1.072917 1.072917 0 
1 1.000000 1.000000 0 


Note that, even with N = 4, we have obtained the exact solution. This happens 
because the error term associated with the RK4 method contains the fifth derivative 
of the solution. In this case, the exact solution is a second degree polynomial, for 
which the fifth derivative is identically zero. The RK4 method will therefore produce 
the exact solution with any value of N. 


Throughout our development of the shooting method, we have assumed that 
we would march “forward” from 2 = a to x = 6. There is no reason, however, 
why we couldn’t march “backward” from z = 6 to z =a. This approach can be 
particularly useful when a Robin condition is specified at « = a, but not at 2 = b. 
In this case, starting from 2 = b reduces the number of initial value problems that 
must be solved from three to two. 
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There are two other circumstances under which reversing direction in the 
shooting method would be an appropriate course of action. First, the shooting 
method, working in either direction, can be prone to roundoff error. If the solution 
obtained marching in one direction is found to be overly contaminated by roundoff 
error, marching in the opposite direction may improve performance. Second, sup- 
pose the solutions to the initial value problems grow exponentially in one direction. 
The solution for the shooting parameter(s)—c or c; and ¢.—may then be extremely 
ill conditioned. Marching in the other direction should alleviate this problem. 


Application Problem: Cooling Fin on the Cylinder Barrel of a Motorcycle 


This problem is adapted from Incropera and DeWitt [1]. The cylinder barrel of 
a motorcycle is made from an aluminum alloy with a thermal conductivity of 
k = 186 W/m-K. The outside diameter of the barrel is 50 mm. Under normal 
operating conditions, the outer surface of the barrel has a temperature of 220° C 
and experiences heat loss to air at T.. = 20°C with a convection coefficient of 
h=50 W/m? -K. 

To improve heat transfer from the barrel to the surroundings, annular fins of 
length 20 mm are added, as shown in Figure 8.11. Note from the cross-sectional 
view that the fins are not of uniform thickness; rather, they taper linearly from a 
thickness of 6 mm at the point of contact with the barrel to a thickness of 4 mm at 
the outer edge. Thus, if r denotes distance from the center of the barrel, measured 
in meters, and ¢(r) denotes the thickness of the fin, also measured in meters, then 


Ps 
t(r) = 0.0085 ~ 55. 
We would like to assess the improvement in heat transfer rate due to the 
presence of the cooling fins. To do this, we must first determine the tempera- 
ture distribution, T(r), within the fin. Let’s assume that temperature variations 
across the thickness of the fin are negligible. Following the same procedure which 
led to equation (6) of Section 8.2, but omitting the internal heat generation term 
and taking into account the nonuniform thickness of the fin, yields the differential 
equation 


(T — Too) = 0. 


eT U(r) l\ dP 2h 
dr? €- *) dr kt(r) 


The boundary conditions associated with this problem are 
T (0.025) = 220 and A7'(0.045) + kT’ (0.045) = AT oo. 


Applying the shooting method with the RK4 method as the underlying initial 
value problem solver and taking N = 100 steps to march from r = 0.025 meters 
to r = 0.045 meters, we obtain the temperature profile shown in Figure 8.12. The 
total fin heat transfer rate is given by 


ap = —k(2nrot(0.025))T’ (0.025), 
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Figure 8.11 Top view and cross-sectional view of annular cooling fins 
attached to the cylinder barrel of a motorcycle. 
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Figure 8.12 Temperature distribution in an annular fin of nonuniform 
thickness attached to the cylinder barrel of a motorcycle. 
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where 7, = 0.025 meters is the radius of the outer surface of the barrel. From the 
shooting method solution, we have the estimate 


T" (0.025) = 554.008. 
m 


Accordingly, 
gp = ~3727(0.025)(0.006)(—554.22) = 97.16 W. 


Without the cooling fin, the heat loss from the same 0.006 meter strip along the 
surface of the barrel would have been 


gq = 2mhrt(0.025)(220 — 20) = 9.42 W. 
The cooling fin therefore increases the heat loss by more than tenfold. 


References 
1. F. P. Incropera and D. P. DeWitt, Fundamentals of Heat and Mass Transfer, 
John Wiley and Sons, New York, 1985. 

EXERCISES 


In Exercises 1-10, suppose we use the shooting method to approximate the solution of 
the indicated boundary value problem. Write out the initial value problems that must 
be solved and the equation(s) to determine the shooting parameter(s). 


Lu! = (e+ 1)u' +2ut+ (1—2*%)e*, u(0)=—-1, u(t) =0 
2. cy” —(x@+5)y’+4y=2, y(l}=—-1, y(2)=1 

2 
3. y= ay - 2, yo) =y/(1) =0 


2+4 (2+ 2)%’ 
4. : (zy')’=1, y/(0)=0, y(1)=10 
5. y!’ = 120y — 2860, 10y(0) + 35y/(0) = 200, (0.15) = 20 
: = (2°') =1l+2", y'(0)=0, 4y(1)+y/(1) =40 
. cy! — (20 +1)y + (2+ Dy =0, y(t) — 2y/(1) = Be, (3) + y'(3) = 26e" 
: : (zy’)’+y=10, y(1)= 20, y(3) +4/(3) = 10 


ao N 


1 
9. y" toy +y=1, y(0)=0, o(1)=1 
10. (1-xcotz)y"—ay'ty =0, y(1)+y’(1) =1+sin(1)+cos(1), y‘(2) = 1+c0s(2) 
Tn Exercises 11-17, numerically verify that the shooting method, with the RK4 method 


as the underlying initial value problem solver, produces results that converge toward 
the exact solution with rate of convergence O(h*). 


1. -u" +7?u=2n’sin(rz), u(0)=u(1)=0, u(x) =sin(rz) 


12. 


13. 


14. 


15. 2 


16. 


17. 
18. 


19. 
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u” +us=sin(3z), wu(0)=1, ulr/2) + u'(n/2) = -1, u(x) = (11/8) sing + 
(5/2) cos x — (1/8) sin(3z) 


u"+u=sin(3z), u(0)+u'(0)=~—1, wu’ (x/2) =1, u(x) = (3/8) sine —cos 2— 
(1/8) sin(3a) 


”" KS 8 } 7 
ut —u= (3) , w(0)=u(l) =0, ue) = 2in (<5) 
yl! — Qwy + Wy = 3827 4+2%nz, y(1) = 9/4, y(2) = 13In2, y(z) = 


3 3 
5 (1+ 52-2") + (14 327)Inz 


y+ay't+y=27, y(0)=0, y(1)=1 


1 
y+ ov +y=1, y(0)=0, y¥W=1 


One component of a model for a styrene monomer tubular reactor is the steady- 
state temperature profile of the solid phase catalyst. The governing boundary 
value problem is 


dr. Lar 
eae 

dr 

= =0, 7(1)=1 
de | .—9 


Here, rT = (Ic -T)/(Tw —T) is the nondimensional catalyst temperature, z is the 
nondimensional radial position, T is the constant temperature of the fluid in the 
reactor, and Ty is the temperature at the wall of the reactor. The parameter 8 
is given by 

ga R°nA 

k(1 — €)’ 

where R = 1.3 cm is the radius of the reactor, h = 107° cal/em?+s-°C is 
the heat transfer coefficient, « = 0.36 is the porosity of the packed bed reactor, 
A = 15 cm! is the surface area of the catalyst per unit volume and k = 
0.0034 cal/em-s-° C is the thermal diffusivity of the catalyst. Approximate 7(z) 
using Ay = 0.0025. 


A thin cylindrical fiber, ten inches in length, has its left end maintained at a 
constant temperature 7) and experiences convective heat loss along its lateral 
surface and from its right end. The temperature within the fiber is governed by 
the differential equation 


eG (1 ¢) +2hrT = 2hrTx., 


= A[T(5) — Too]. 
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20. 


21. 


22. 


23. 


24. 
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The parameters in this problem are the thermal conductivity of the fiber & = 
2BTU/sec -in- °F, the convective heat transfer coefficient h = 10°->5BTU/sec : 
in?° F, the radius of the fiber 


0.1 
= 0.002 es 
r = 0.00 (1+ eo) inches, 


the ambient temperature of the surroundings Too = 50°F and the constant 
temperature maintained at the left end of the fiber T) = 200° F. Determine the 
temperature within the fiber at increments of 0.1 inches. 


A wooden beam of square cross section is supported at both ends and is carrying 
a distributed lateral load of uniform intensity w = 20 lb/ft and an axia) tension 
load T = 100 lb. The deflection, u(x), of the beam’s centerline satisfies the 
boundary value problem 


” T __ w 
ame Tae 
u(0) = u(L) = 0, 


where L = 6 ft is the length, B = 1.3 x 108 Ib /in? is the modulus of elasticity 
and I = s* is the moment of inertia of the beam. The side length of the square 
cross section is s = 4 inches. 
(a) Determine the deflection of the beam at 1 inch intervals along its length. 
(b) Repeat part (a) assuming that the beam tapers along its length so that 

s = (4-—2/2L) inches. 
Repeat Exercise 20 for a metal rod of circular cross section. Use the parameter 
values 

w = 200 lb/ft, T = 750 lb 
L=10ft, B=3.0x 10" lb/in*, and I= mr4/4. 


For part (a), take r = 3 inches. For part (b), use r = (3+0.25sin(sa/ZL)) inches. 


In the “Flow between Parallel Plates” problem of Section 8.1, suppose we intro- 
duce the effect of a constant pressure gradient, which we denote by dp/dz, in 
the direction of the flow. The boundary value problem for determining the flow 
velocity then becomes 


au! = go | 842888596 839.456] | h? dp 
7 (293.16 + 80y)3 (293.16 + 80y)? Uo dx 
u(0)=0, u(l)=1 
~ 4B 


The constant tet is called the pressure parameter. Calculate the velocity 
distribution for valued of the pressure parameter of —4, —2, 0, 2, and 4. 


Rework the first application problem from Section 8.2, “Steady-State Tempera- 
ture Distribution in a Pin Fin,” using the shooting method. ; 


Rework the second application problem from Section 8.2, “The Heat Pack,” using 
the shooting method. 
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25. Rework the application problem from this section, “Cooling Fin on the Cylinder 
Barrel of a Motorcycle,” changing the temperature of the outer surface of the 
barrel to 250° C and the convection coefficient to h = 100 W/m-K. Use the 
values given in the text for all other parameters. 

26. Rework the application problem from this section, “Cooling Fin on the Cylinder 
Barrel of a Motorcycle,” but assume the fin has a uniform thickness of 6 mm. 
Use the values given in the text for all other parameters. 


8.5 THE SHOOTING METHOD, PART II: NONLINEAR BOUNDARY VALUE 
PROBLEMS 


We now turn our attention to the development of the shooting method for the 
general nonlinear boundary value problem 


y" = f(t,y.y'), arya) +a2y'(a) = a3, Bry(b) + Bay’ (b) = Bs. 


As indicated in the previous section, the fundamental strategy behind the shoot- 
ing method is the transformation of a boundary value problem into a rootfinding 
problem that is, itself, based on the solution of an initial value problem. 


A Specific Boundary Value Problem 


Let’s develop the details of this approach within the context of a specific example. 
Suppose we wish to approximate the solution of the nonlinear differential equation 


yy +’)? +1=0 
subject to Dirichlet boundary conditions 


yl) =1, (2) =2. 


Since the value of the solution is known at x = 1, only a value for the first derivative 
needs to be supplied to complete the specification of an initial value problem. Let 
y(z;p) denote the solution of the initial value problem that results upon setting 
y'(1) = p. To obtain an approximation to the solution of the original boundary 
value problem, we must determine the value p = p* for which y(2;p*) = 2. If we 
define the objective function F(p) = 2 — y(2;p), then solving the boundary value 
problem becomes equivalent to locating the root of F. This is the aforementioned 
rootfinding problem. 

To approximate p* we can use any of the rootfinding techniques that were 
developed in Chapter 2. In the present framework, each “evaluation” of the func- 
tion F requires the solution of an initial value problem. Therefore, we will avoid 
the use of the bisection method: The linear convergence of the method is much too 
slow. We will also bypass Newton’s method. Although Newton’s method would 
provide quadratic convergence, it requires the calculation of 


F'(p) = $20) 
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This quantity can be obtained by solving a second initial value problem (see Asaith- 
ambi {1] or Burden and Faires [2] for details), but that doubles the computational 
cost of each iteration. Hence, we will implement the secant method. This may 
require a few more iterations than Newton’s method but generally will result in 
the numerical solution of fewer initial value problems. As a reminder, recall that 
starting from the initial iterates py and p,, the secant method generates a sequence 
of approximations according to the rule 


eo _ Pn-1 ~ Pn—-2 
Pn = Pn-1 Fm-VEGq_) _ FQ@n-2) 


Returning to the problem at hand, we will use the classical fourth-order 
Runge-Kutta (RK4) method to solve all initial value problems, taking ten steps 
to integrate from x = 1 to z = 2. With pp = 0 [ie., setting y/(1) = 0], we find 
y(2; 0) = 0.104101, so that F'(0) = 2 — y(2;0) = 1.895899. If we next choose p; = 1 
the RK4 method produces the value y(2;1) ~ 1.414197, from which we compute 
F(1) = 0.585803. Applying the secant method then yields 

2 58580) a a arcs 
Pa ES T585803 — 1.895899 
Assigning this value to y’(1) leads to y({2:p2) & 1.701210 and F(p2) = 0.298790. 
Thus, 


It 


1.447145 —-1 
p3 = 1.447145 — 0 298790 558790 9.585803 912638 


Continuing in this fashion, the next four iterations produce the results 


p3 = 1.912638, (2; p3) ¥ 1.955692, -F(p3) =: 0.044308 

pa = 1.993685, y(2;p4) 1.996677, F'(pa) = 0.003322 

Ds = 2.000256, y(2;p5) ¥ 1.999963,  _-F(ps) = 3.698 x 10-5 
Yg — 2.000330, y(2; pe) = 1.999999969, F(pe) = 3.088 x 10°8; 


With |[F'(pg)| <5 x 107 7 we will terminate the iteration and accept ¥{2; ps) as the 
approximate solution to the boundary value problem 


yy" + (y')? +1=0 
y(j=1, y(2) =2. 


The convergence tolerance has been applied to |F(p,)|, and not to the difference be- 
tween successive p values, since the true objective is to match the second boundary 
condition, not to approximate the necessary initial condition for the first derivative. 

The values of the approximate solution, y(xi;p6), are tabulated below. For 
comparison, the values of the exact soluticn, 


y(z) = V6x -4-2?, 


are also displayed. The final column of the table lists the absolute error in the 
approximate solution. 
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Absolute 
Zi -¥(Z5 De) y(zs) Error 

1.00 1.000000 1.000000 
1.10 1.178956 1.178983 0.000027 
1.20 1.326623 1.326650 0.000027 
1.30 1.452560 1.452584 0.000024 
1.40 1.562030 1.562050 0.000020 
1.50 1.658296 1.658312 0.000016 
1.60 1.743547 1.743560 0.000013 
1.70 1.819332 1.819340 0.000008 
1.80 1.886790 1.886796 0.000006 
1.90 1.946789 1.946792 0.000003 
2.00 2.000000 2.000000 


Finally, we investigate the order of convergence. Since the RK4 method has 
been used to compute the solution of all of the initial value problems, we expect that 
the approximate solution of the boundary value problem should converge toward 
the exact solution with rate of convergence O(h*). The following table lists the 
maximum absolute error and the root. mean square (rms) error in the approximate 
solution as a function of N, the number of steps taken to integrate from z = 1 to 
x = 2. The convergence tolerance for the secant method was set at 5 x 107*4 for 
all cases. Note that with each doubling of N, the error drops by roughly a factor 
of 16 = 24, exactly what one would expect from a fourth-order method. 


Maximum 
N Absolute Error Error Ratio rms Error Error Ratio 
5  3,048872 x10~4 1.851011 x10~4 


10 2.726369 x1075 = 1.182906 ~=—-:1.710854 «10-5 ~—-:10.819221 
20 1.812206 x10-® 15.044472 1.116743 x10-® + ~—15.320038 
40 1.116332 x107-7 =: 16. 233579 = 6.885077 x107-8 ~=— 16.219755 
80 6.870329 x10-® =: 16. 248596 = 4.239422 x 10-9 ~—- 16240602 
160 4.252689 x107!9 -16.155256 =. 2.625002 x10-1® =: 16.150171 
320 2.643930 x10722 16.084730 1.632357 x10-2 ~—-16.081050 
640 1.648681 x10722.  16.036633. =: 1.017769 «107-12, —-:16.038575 
1280 1.023626 x107}3) =—s-16.106291 6.221091 x107-!4 —- 16.359982 


Other Boundary Conditions 


The shooting method can also be used to approximate the solution of a boundary 
value problem which has boundary conditions which are not Dirichlet. We simply 
have to adjust the initial conditions and/or the objective function to reflect the 
specific set of boundary conditions given in the problem statement. For instance, 
if the Neumann boundary condition y’(a) = @ is specified at x = a, then the initial 
value of the solution becomes the variable in the rootfinding problem, and the initial 
conditions applied to the differential equation must take the form 


yaj=p, y(a)=a. 
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The case of a Robin boundary condition at x = a will be discussed momentarily. As 
for the objective function, this is derived from the boundary condition at the right 
endpoint. In the case of the Neumann condition y’(b) = 6, the objective function 
would be F(p) = 6 — y'(b;p); whereas, in the case of the Robin condition B)y(b) + 
Boy'(b) = Gs, the objective function would be F(p) = Bs — Biy(b; p) — Boy’ (b; p). 


EXAMPLE 8.10 A Problem with One Neumann and One Robin Boundary 
Condition 


The exact solution of the nonlinear boundary value problem 


2ay" + (y')? — dy = 4x 
y/(1)=4, y(3) + 2y’(3) = 32 


is y(z) = (c + 1)*. To approximate the solution of this problem using the shooting 
method, we convert the boundary value problem into the rootfinding problem 


determine the value of p for which F(p) = 32—y(3; p) —2y’ (3; p) is equal 
to zero. 


Here, y(x; p) denotes the solution of the initial value problem 


Qay” + (y’)* — 4y = 4a 
yj=p, y(l)=4. 


Note how the boundary condition at « = 1 (the left endpoint) dictates the initial 
conditions on the initial value problem, while the boundary condition at z = 3 (the 
right endpoint) dictates the objective function for the rootfinding problem. 

The RK4 method is used to compute y(z;p) for each p = pa, with a step size 
of Ax = 0.2. The secant method is initialized with pp = 0 and p; = 1, and iterations 
are terminated when |F(p)| falls below a tolerance of 5 x 107’. Five iterations are 
needed to achieve convergence. The results of each iteration are listed below, where 
the values of y(3;p) and y’ (3; p) have been reported to six decimal places for display 
purposes. 


p y(3; 2) y' (3; p) F(p) 
0 8.303482 5.781270 —:12.133978 
l 10.361601 6.445059 8.748282 


3.583894 15.254059 7.811455 1.123031 
3.964445 15.936660 7.984070 0.0952008 
3.999693 15.999486 7.999773 9.677 x10~4 
4.000054 16,000131 7.999934 8.265 x10~7 
4.000055 16.000132 7.999934 7.168 x10-}4 


The values of y(z;; pg) are shown in the following table and compared with the 
values of the exact solution. The absolute errors are all on the order of 1074, even 
with a step size as large as Az = 0.2. 
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Absolute 
2; -y(wype) y(as) — Exvor 
1.00 4.000055 4.00 0.000055 
1.20 4.840102 4.84 0.000102 
1.40 5.760118 5.76 0.000118 
1.60 6.760124 6.76 0.000124 
1.80 7.840125 7.84 0.000125 
2.00 9.000125 9.00 — 0.000125 
2.20 10.240126 10.24 0.000126 
2.40 11.560127 11.56 0.000127 
2.60 12.960128 12.96 0.000128 
2.80 14440130 14.44 0.000130 
3.00 16.000132 16.00 0.000132 


The final case to consider is that of a Robin boundary condition specified 
at the left endpoint, « = a. Theoretically, we can let either the initial value of 
the solution or the initial value of the first derivative serve as the variable for the 
rootfinding problem and compute the other initial value from the Robin boundary 
condition. That is, if the boundary condition at z = a is a y(a) + aay'(a) = as, 
then we can either apply the initial conditions 


y(a)=p, y’(a) = (a3 — onp)/a2, 
or the initial conditions 
y(a) = (a3 —a2p)/o, y'(a) =p. 


Unfortunately, the solution of an initial value problem with initial conditions of 
this type is significantly more sensitive to the choice of p than is the solution of an 
initial value problem for which one of the initial values is fixed. 

To illustrate this point, consider the differential equation y” = 2y°. Whether 
the boundary conditions are 


y(0)=1/3, y(1) = 1/4, 
y'(0)=—1/9,  y(1) =1/4 


or 


3y(0) — 9y'(0) = 2, yQ) = 1/4, 


the exact solution to the resulting boundary value problem is y(2) = 1/(a + 3). If 
we were to use the shooting method to solve each of these boundary value problems, 
the first set of boundary conditions would give rise to the initial conditions 


y(0) = 1/3, y'(0) =p, (1) 
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the second set would lead to the initial conditions 


y(0) =P, y'(0) = -1/9, (2) 


and the third set would give either 


y(0) = (2/3) + 3p, y/(0) = por (3) 
y(0) =p, y/(0) = —(2/9) + (1/8)p. (4) 


Figure 8.13 displays the solutions of y” = 2y® subject to initial condition (1)— 
top graph—and initial condition (3)—bottom graph—for p = —1, —1/2, 0, 1/2 
and 1. The solution is clearly much more sensitive to the choice of p with initial 
condition (3). This sensitivity can be expected to play havoc with the convergence 
of the shooting method. 

Next, the solutions of y” = 2y° subject to initial condition (2)—top graph— 
and initial condition (4)—bottom graph—for p = -1, —1/2, 0, 1/2 and 1 are 
displayed in Figure 8.14. Though not as dramatic as the situation depicted in 
Figure 8.13, the solution of the initial value problem is still clearly more sensitive 
to the choice of p with initial condition (4). 

The moral of this story is simple. When a boundary value problem has a 
Robin boundary condition at the left endpoint of the domain, but not at the right 
endpoint, it is advisable to reverse direction and integrate from + = b back to z = a. 
This can be accomplished by introducing the change of independent variable 


1 


rem aed 


os 


Note that 2 = b — a corresponds to z = 0 > 1. To replace the derivatives in 
the differential equation and in the boundary conditions, we make use of the chain 


rule to determine 
d d dz 1 od 


dz dzdr b—adz 


and 
a i a a ee ee 
dx? dx \ dz “b—adz\ b-adz} (b-a)? dz?” 


EXAMPLE 8.11 Reversing the Direction of Integration 


Let’s approximate the solution of the nonlinear boundary value problem 


yf! = dy? 
3y(0) — 9y'(0) = 2, y(1) = 1/4 


using the shooting method. Since this problem has a Robin condition specified at 
the left end of the domain, we first introduce the new independent variable z = 1—z. 
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Figure 8.13 (Top) Solution of y” = 2y® subject to the initial con- 
ditions y{0) = 1/3 and y’(0) = —1,—-1/2,0,1/2,1. (Bottom) Solution 
of y” = 2y° subject to the initial conditions y(0) = (2/3) + 3y’(0) and 
y'(0) = —1,-1/2, 0, 1/2, 1. 
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ab 
or 
9 —+ 
7a ea 
ain ~, 4 
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Figure 8.14 (Top) Solution of y” = 2y* subject to the initial condi- 
tions y(0) = —1,—1/2,0,1/2,1 and y'(0) = —-1/9 (Bottom) Solution of 
y” = 2y° subject to the initial conditions y(0) = ~1, -1/2,0, 1/2, 1 and 
y’ (0) = (2/9) + (1/3)y(0). 
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In terms of this new variable the boundary value problem becomes 


yf = 2y° 
y(0) = 1/4,  3y(1) + 9y/(1) = 2, 


where primes now denote differentiation with respect to z. 


To apply the shooting method to the Jatter boundary value problem, we need 
to solve the rootfinding problem 


determine the value of p for which F(p) = 2—3y(1;p) — 9y’(1;p) is equal 
to zero, 


where ¥{2;p) is the solution to the initial value problem 


y(0) = 1/4, y'(0) =p. 


Let’s solve the initial value problems using the RK4 method with a step size of Az = 
0.1. For the rootfinding problem, take pj) = 0 and p, = 1 and use a convergence 
tolerance of 5 x 10714, With six iterations of the secant method, the solution 
converges to the values listed below. Since z = 1-2, the values in the last column 
were obtained via the formula 


y(@s5 D7) = yC. — 26; py). 


ay y (445 p7) ie ¥¥(@u; Pr) 
0.00 0.250000 | 0.00 0.333333 
0.10 0.256410 | 0.10 0.322581 
0.20 0.263158 | 0.20 0.312500 
0.30 0.270270 | 0.30 0.303030 
0.40 0.277777 | 0.40 0.294118 
0.50 0.285714 | 0.50 0.285714 
0.60 0.294118 | 0.60 0.277777 
0.70 0.303030 | 0.70 0.270270 
0.80 0.312500 | 0.80 0.263158 
0.90 0.322581 | 0.90 0.256410 
1.00 0.333333 | 1.00 0.250000 


All solution values are correct to the digits shown. In fact, the largest absolute 
error occurs in the value of y(aq = 0;p7) and is roughly 2.13 x 1078, 


Application Problem: Density-Dependent Dispersal of an Insect Population 


When we investigated the spread of an insect population in the Chapter 8 Overview 
and in Section 8.3, we assumed the coefficient of diffusion, D, was constant. Suppose 
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we now make the more realistic assumption that D is an increasing function of the 
population density, N. In particular, let’s take 


o-a(gy. 


where Dg is a positive constant, K is the environmental carrying capacity and 
m > 0. Using this density-dependent diffusion coefficient, equation (11) from the 
Chapter 8 Overview becomes 


d N\™ dN N 
ax Pe(z) al try (tz) =o 
Dy @(N/K)™}) N 


N 
- aya tm + De (1-Z) =0 


Introducing the nondimensional variables n = (N/K)™*? and 2 = X./r/Dp yields 
the differential equation 


or, equivalently, 


rh (me tymileme fy = nil] 0 6) 


where primes denote differentiation with respect to 2. 
To investigate the effect of density-dependent dispersal on the steady-state 
density profile, (5) was solved, subject to the boundary conditions 


n(0)}=0 and n’(3) =0, 


form = 1, 2, and 3. When applying the shooting method, all initial value problems 
were handled using the RK4 method with a step size of Az = 0.01. The secant 
method was initialized with po = 1 and p; = 2, and iterations were terminated 
when 

|F(p)| = |n'(3;p)| <5 x 107°. 


Seven, four, and five iterations were needed to achieve convergence for the case 
m = 1, 2 and 3, respectively. 

Results of these experiments are summarized in Figure 8.15. For each value 
of m, the quantity 

Uf(m+1) _ N(z) 
[n(z)] — K 

is plotted. The density profile corresponding to a constant coefficient of diffusion 
has been included for comparison. To obtain the constant D profile, the secant 
roethod was initialized with pp = 1 and p, = 0.5, and seven iterations were needed 
to achieve convergence. Observe that increasing the value of m leads to an overall 
increase in the density profile. 
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Figure 8.15 Steady-state population density profiles for an insect pop- 
wation which disperses with a deusity-dependent dispersal rate. 
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EXERCISES 


In Exercises 1-6, suppose we use the shooting method to approximate the solution of 
the indicated boundary value problem. Write out the initial value problem that must 
be solved and the objective function for the corresponding rootfinding problem. 

Ly ty =1, ¥(0)=1, ylp=2 

2 y"+uy' +(y'?=0, yO) =0, ylty(1)=1 

3. y= —2y* —dayy’, y/(0)=0, (4) = 1/2 

4. y" + dy! =-2y/(L+2*), y(0)=1 (1) =0 

5. y + 2y? —827y2 =0, y(0)=1, 9/1) =-1/2 

6. y" =6y", ¥(O)=1/4, vty) =3 


In Exercises 7-10, make a change of independent variable to reverse the direction of 
integration, and write out the resulting boundary value problem. 


7 ty tty’)? =0, y(O)+y'(0)=0, y'(1)=l/e 


Section 8.5 Shooting Method, Nonlinear Problems 723 


8. y” = —-2y" —Azyy’, y(0)+y'(0)=1, y(1) =1/2 
9. y+ 7 Se y(0)+y/(0)=1, y(1)=0 
y + 2y? - 8a7y? =0, (0) +y'(0) =1, y(1)=1/2 


In Exercises 11-16, numerically verify that the shooting method, with the RK4 method 
as the underlying initial value problem solver, produces results that converge toward 
the exact solution with rate of convergence O(h*). 


11. y” = —2y? + 82%y?, y(0)=1, y(1})=1/2,  y(a) =1/(1 +27) 

12. y" + 4yy/ = —2y/(1+27), y(0)=0, y’(0)=1/2, y(x) =2/(1+2”) 
13, y"+(y')?=1, y/(0)=0, y(l)=2 

14, y+ yy +(y'? =0, y(0)=0, y(t)ty(1I)=1, y(z)=1-e* 

15. y+ 4anyy! = —2y", y(0)+y/(0)=1, y()=1/2, y(z) =1/(1 +2”) 
16. yy” =y'-1, y)=0, y/(2)=14+1n2 

17. Consider the nonlinear differential equation 


day” + (y’)? — 4y = 4x. 


Use the shooting method to solve this differential equation subject to each of 
the following sets of boundary conditions. In each case, the exact solution is 
y(z) = (c + 1)?. How rapidly does the approximate solution converge toward 
the exact solution as a function of the number of subintervals? 
(a) yl)=4, yQ2)=9 (b) y(1)=4, y"(2) =6 
(c) ¥(=4, ¥(2)=6 (d) y)—y(1) =0, 9/2) =6 

18. Ramirez (Computational Methods for Process Simulation, Butterworths, Bos- 
ton, 1989) develops the following nonlinear boundary value problem for the con- 
centration, y(z), of the reactant in a second-order chemical reaction taking place 
within a tubular reactor with dispersion: 


dy - 0H —5y? =0 


dx? 
i dy dy 
10) 10 dz), _4 ~~ da lay ae 


Determine the concentration in increments of 0.01 along the length of the reactor. 


19. C. Philipsen, S. Markvorsen and W. Kleim [“Modelling the Stem Curve of a 
Palm in a Strong Wind,” SIAM Review, 38 (3), pp. 483-484, 1996] present the 
following model for the angle of the stem of a tall palm tree, relative to its vertical 
position, when the tree is subjected to wind loading: 


EIS = -W. (1- 7) sing — Wesin® — Dcos6. 
Here, @ is the angle of the stem relative to the vertical position, and s is arc 
length measured along the stem. The parameters in the model are the total 
stem {Weight W, = 22700 N, the Young’s modulus of the stem E = 0.9 x 10° 
N/rn?, the moment of inertia of the stem J = 5.147 x 1074 m‘4, the length of 
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21. 


22. 
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the stem L = 30 m, the total canopy weight W, = 1385.5 N, and the wind drag 
force on the canopy D = 1.2405U? N, where U is the wind speed in m/s. The 
boundary conditions imposed on the stem angle are 

0(0)=0 and 6(L)=0. 


Determine the function 6{s) when the wind speed is 8 meters/second. 


. Rework the “Density-Dependent Dispersal of an Insect Population” problem for 


an environment with a nondimensional length of 5; that is, replace the boundary 
condition n’(3) = 0 with the condition n'(5) = 0. Produce a plot similar to 
Figure 8.15 to display the results. 

Subramanian and Balakotaiah [“Convective Instabilities Induced by Exothermic 
Reactions Occurring in a Porous Medium,” Phys. Fluids, 6 (9), pp. 2907-2922, 
1994] develop the boundary value problem 


a6 2 8 70 a 
6'(0)=0, @(1)=0 


for the steady-state temperature profile, @(z), in a porous medium undergoing 
an exothermic reaction. The parameter B is the maximum possible temperature 
in the absence of natural convection, ¢ is the ratio of the characteristic time 
for conduction to that for heat generation and ¥ is the dimensionless activation 
energy. For B = 6.0, ¢? = 0.25, and y = 30.0, determine 0(z). 


For values of / ranging from 0.1 to 5.0 in increments of 0.1, solve 
n! +2/n(1— Vn) =0, n(0)=0, n'(l)=0. 


Use a step size of Ax = 0.01 for each l. Plot the maximum value of the population 
density as a function of J. 


CHAPTER 9 


Elliptic Partial Differential 
Equations 


AN OVERVIEW 
Fundamental Mathematical Problem 


In this chapter, we will develop the finite difference method for approximating the 
solution of elliptic partial differential equations. The second-order linear partial 
differential equation 


2. 


Pu a 0 6) 
Ato) FE+2B(e,y) gre +(e, u) p+ Du) ge + Ela) 5-4 F(a,y)u= Gle,9) 


i We 


is said to be elliptic over some region r in the z-y plane provided that 
A(x, y)C(2,y) - (Ba, y)P > 0 


for all (x,y) € R. The inequality A(z,y)C(a,y) — [B(a,y)|* > 0 is called the 
ellipticity condition. 

Elliptic partial differential equations model time independent problems and 
often arise when determining the steady-state solutions to time-dependent prob- 
lems. Elliptic problems are also the multidimensional analogue of the boundary 
value problems we studied in Chapter 8. There are two special cases of the general 
elliptic equation that we will investigate in detail in this chapter. The first is the 


Poisson equation 
Poisson equation 


The second arises when f(z, } 


Laplace equation 


This is the so-called Laplace equation. 

We will focus our studies primarily on problems specified over rectangular 
domains. The left-hand part of the following figure illustrates a typical rectangular 
domain: 


R={(z,yJla<e<be<y<d}. 
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The symbol OR denotes the boundary of r, which, in this case, is simply the rectan- 
gle itself. Elliptic partial differential equations can also be specified over nonrect- 
angular, or irregular, domains, such as the one pictured below at the right. With 
the exception of circular domains, however, finite differences generally do not work 
well on nonrectangular domains. We will therefore consider special cases only. 


As for boundary conditions, we will discuss the same three types that were 
treated in Chapter 8: 


Dirichlet: u(z,y) = r(z,y) on OR 
Neumann: 4 (z,y) =r(x,y) on OR 
Robin: a(x, yju(z,y) + A(x, y)$* (x,y) = r(x,y) on OR 


In general, different types of boundary conditions may be specified on different 
portions of the boundary. In the Neumann and Robin boundary condition, 0/dn 
refers to the derivative taken in the outward normal! direction. The interpretation 
of this derivative for different domains is illustrated in Figure 9.1. 


Flow Through a Contraction Duct 


A fluid flows into the asymmetric contraction duct shown in Figure 9.2 with a 
uniform velocity of 3 meters/second and exits with a uniform velocity of 12 me- 
ters/second. The duct has a total length of three meters. The inlet section is one 
meter long and two meters high. Along the upper wall, the duct contracts at a 
45° angle over a distance of one-half meter, while, along the lower wall, the duct 
contracts at a 45° angle over a distance of one meter. We will assume that the duct 
is of sufficient depth to warrant treating the flow as two dimensional. 

Let u(x, y) denote the flow velocity in the z-direction and v(a2,y) denote the 
flow velocity in the y-direction. Consider the representative fluid element in Fig- 
ure 9.3, which has length Az, height Ay, and unit depth. According to the law 
of conservation of mass, the rate of change of fluid mass within this representative 
element must be equal to the difference between the rate at which mass enters and 
the rate at which mass exits the element. If p denotes the mass density of the fluid, 
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Figure 9.1 Interpretation of outward normal derivative for different 


domain configurations. 


45° 
¥ 
Uniform 
Uniform — > outflow 
inflow —> x bei 
3 m/s 
45° 


Figure 9.2 Geometry of an asymmetric contraction duct. 
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Geaus “1 


uAy-] — es Ay -1 
pudy Be 


pvdx-] 


Figure 9.3 Representative fluid element for flow through the contrac- 
tion duct of Figure 9.2. The mass flow rates across each face of the 
element are indicated. 


then the element contains a total mass of p Az Ay-1. The rate at which fluid mass 
crosses each face of the element is indicated in Figure 9.3. Therefore, conservation 
of mass requires 


a(p Az Ay) 
Ot 


B(pu) (pv) 
s a = pease A 
pu Ay + pu Ax (+ + On Az) Ay (e» a5 By Ay | Az, 


or, upon simplification, 
dp | Xpu) | Aer) _ 
at ax * by . 
This is known as the continuity equation. If we assume that the fluid is incompress- 
ible (i.e., that p is constant), then the continuity equation becomes 
Ou Ov 
—+ ~ =0. 1 
Or Oy a 


At this point we introduce a clever mathematical device. Define the stream 
function, p(x, y), such that 


u=— and v=-—. (2) 


du dv _ a (ap a (2) 2 
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so equation (1) is satisfied identically. In addition to reducing the number of vari- 
ables by one in such a way that conservation of mass is guaranteed, the stream 
function has an important geometric interpretation. Curves along which ~ is con- 
stant are the streamlines of the flow, where a streamline is defined as a curve that 
is everywhere tangent to the flow field at a given instant. For the problem under 
consideration, in which the flow is time independent, streamlines also coincide with 
the actual paths traveled by the fluid particles. 
So how do we determine 7? The expression 


1fdv du 

(a ay) 2 
defines, the angular velocity of the flow, which is denoted by w. Substituting (2) 
into (3) yields 


=3(# _ ob - 5 (%) --i(3+S 
~ 2\de\ bx} dy \ dy 2\ Ax? © dy? }? 


Ox? ~ dy? . 
We will now make one last assumption, which is that the flow is irrotational. For 
an irrotational flow, w = 0 and so 
ey Ow 


oe Oe 4) 


or 


We still need to specify appropriate boundary conditions for w. First, let’s 
fix the origin of our coordinate system at the lower left corner of the duct. At the 
inlet, we know that u(0, y) = 3; therefore, 

Wow=3 >  v(0u) = 3, (5) 
y 
where we've set the constant of integration to zero for convenience. Since the fluid 
cannot penetrate the walls of the duct, the fluid velocity normal to any wall must be 
zero. Using the definition of the stream function, it follows that the velocity normal 
to any surface is equal to the derivative of w taken along that surface. This leads 
us to conclude that ~ must be constant along the lower wall and along the upper 
wall of the duct. But #~(0,0) = 0 and w{0,2) = 6, so the appropriate boundary 
conditions are 
(x,y) = 0 along the lower wall (6) 


and 
w(x, y) = 6 along the upper wall. (7) 


Finally, at the exit of the duct, we find 


w(3, y) = 12(y — 1). (8) 
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Thus, to determine the flow through the contraction duct, we must solve the 
elliptic partial differential equation (4) subject to the Dirichlet boundary conditions 


(5)-(8). 


Remainder of Chapter 


We will begin our investigation of the numerical solution of elliptic partial differen- 
tial equations with the Poisson problem on rectangular domains subject to Dirichlet 
boundary conditions. Emphasis will be placed upon the derivation of the discrete 
analogue of the Poisson equation and the organization of the discrete equations into 
matrix form. More general boundary conditions are treated in Section 9.2. Next, 
iterative strategies for solving the discrete equations will be considered. Collec- 
tively, these techniques are referred to as relaxation schemes. In the fourth section, 
an analysis of the convergence properties of relaxation schemes will be undertaken, 
and the notion of multigrid methods will be introduced. Elliptic equations over ir- 
regular domains will be treated in Section 9.5. The focus will be placed on circular 
domains. 


Things to Come 
As noted above, the partial differential operator 
ou om om e) a] 

= A(z, y)—s5 + 2B(2,y)—— + C(z, y)— + D(a, y)— + Elz,y)— + Fla 
is said to be elliptic when A(z, y)C(z, y) - (Bla, y))? > 0 for all (z,y) € R. When 
A(z, y)C(z,y) - [B(a,y)|? = 0 for all (x,y) € R, the partial differential equation 
is said to be parabolic. Such problems are often time dependent and frequently 
arise when modeling diffusive processes. When A(z, y)C(a,y) —[B(z,y)]” < 0, the 
partial differential equation is said to be hyperbolic. These problems arise in the 
study of wavelike phenomena. The numerical solution of parabolic equations will 
be treated in Chapter 10 and the numerical solution of hyperbolic equations m 
Chapter 11. 


9.1 THE POISSON EQUATION ON A RECTANGULAR DOMAIN, I: DIRICHLET 
BOUNDARY CONDITIONS 


We begin our study of the numerical solution of elliptic partial differential equations 
using the finite difference method with an investigation of the Poisson equation 
ui Ou 
aoa t ae = Ie 
ant * Bp f(x,y) 
on the rectangular domain 
R=({(a,yla<ur<be<y<d} 


subject to the Dirichlet boundary conditions u(z,y) = g(x,y) for all (z,y) € OR. 
Erophasis will be placed upon the derivation of the discrete analogue to the Poisson 
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Figure 9.4 Computational grid for rectangular domain. 


equation and the organization of the discrete equations into matrix form. Non- 
Dirichlet boundary conditions will be treated in Section 9.2. 


Discrete Analogue to the Poisson Equation 


The first step in determining the finite difference approximation to the solution of 
the Poisson problem is to introduce a computational grid. Divide the intervals [a, 5] 
and |c, d| into N and M equal-sized subintervals, respectively, where N and M are 
positive integers. For convenience, we will assume that 


b-a d-c 
N M cee 


Note that this last condition imposes a restriction on the domain. In particular, 
the aspect ratio of the rectangle {i.e., the ratio of the height to the length) must be 
a rational number. We will remove this restriction in the exercises. 

With the number of subintervals in each coordinate direction selected, the 
computational grid then consists of those points (aj, yx), where 2; = a+ jh and 
ye =c+kh for 7 =0, 1, 2,..., N and k = 0, 1, 2,..., M (see Figure 9.4). To 
simplify the expressions for the discrete equations, we will adopt our usual subscript 
notation for function values: 


Ujk = ult, ye), fie = F(i, YR) Gi" = 9(25, Ye): 


With the computational grid defined, next evaluate the governing partial dif- 
ferential equation at each interior point (i.e., those for which j = 1, 2, 3,..., N-1 
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and k= 1, 2,3,..., M@—-1) 


Ou Fu 
Ee ye i) 


and substitute the following second-order central difference approximations for the 
partial derivatives: 


(23 ,4x) 


Oru Uj—te — 2Uy ke + Uy 

ae af Lk Sik j+h,k + O(h?) 
©" | (ay ,y) h 

eat — Uj,k=1 — age + Uy b+ + O(R2). 

dy (23 ,9%) h 


This yields the equation 


Ujm1jk — 2Uj b+ Uj 41k 
h2 


2ujk + Uy ke 


2), Uj k-1 7 
+ O(h*) + I 


+O(h?) = hee 
Now, drop the truncation error terms and replace each u;,z (value of the exact 
solution) by wy,, (approximate value). Finally, multiply through by —h? and collect 
like terms to obtain 


2: 
TWi— i,k ~ Wy41,k ~ Wy,k-1 — Wy,ktr + 4Wy,6 = —h* fin. (1) 


For j=1,j7=N-1,k=1lork=M-—1, at least one of the first four terms 
in equation (1) will be associated with a boundary grid point. Since Dirichlet 
boundary conditions have been specified, we can set these w values equal to the 
value of the function g(x, y) at the corresponding grid point. This provides us with 
the set of boundary condition equations 


WikR=agK G=O7=N,kK=0,k = MM). (2) 


Combining equation (1) for each interior grid point with equation (2) for each 
boundary grid point produces the discrete analogue of the Poisson problem subject 
to Dirichlet boundary conditions: 


Wink 7 Whetk — Wi kt — Wj,k+1 + 4W;.k = hfs 
ip eNo1 
1lck<M—-l] 


wir= He G=0j7=N,K=0k = M) 


Since we arrived at this finite difference approximation by dropping second-order 
truncation error terms, we expect that this numerical method will be second or- 
der. We will verify this fact experimentally in both the worked examples and the 
exercises. 

As in the case of one-dimensional boundary value problems, it is useful to 
think of equation (1) as a computational template which is to be applied at each 
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k+1 } 

2 
sO) 2 a 
a © 


jr-i j j+l 


Figure 9.5 Five-point star approximation to the Laplacian operator. 


grid point where the value of the approximate solution is unknown. Given the two- 
dimensional geometry of the problem, it is perhaps best to visualize this template 
as in Figure 9.5. Since the template uses values of the approximate solution from 
five neighboring points within the grid, it is often referred to as the five-point star 
approximation to the Laplacian operator 


eM a? 
a ee 


Organizing the Equations 


The system of algebraic equations embodied by equations (1) and (2) is linear in 
the unknowns; hence, the system can be written in matrix form. What is the 
structure of the coefficient matrix for the system? The answer to this question 
depends heavily upon the numbering scheme used to order the unknowns. 

One of the most common numbering schemes is lexicographic ordering. Here, 
we start at the bottom left corner of the computational grid and number the un- 
knowns consecutively from left to right and from bottom to top. For example, 
suppose the problem domain is a square and we take N = M = 4, as shown 
in Figure 9.6. The number written next to the location of each unknown is the 
lexicographic number associated with that unknown. 

Placing the five-point star “over” the first unknown produces the equation 


xs pg 
—W1,0 — Wo,1 + 44,1 — Wa. — W12 = Afi. 


The first two terms, w19 and wo1, are associated with boundary nodes, so these 
values are given by the boundary conditions. Substituting wy,9 = 91,0 and wo = 
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Value known from 
boundary condition 


Value unknown, to 
© be approximated by 
numerical method 


Figure 9.6 5 x 5 grid placed over a square domain. The unknowns 
are numbered in lexicographic order. 


9o,) and rearranging so that only terms involving an inaiows appear on the left- 
hand side, we arrive at 


41 — Wa, — Wi2 = —h* fi. + 91,0 + 90,1- 
Moving the five-point star to the second unknown, we obtain 
—W2,0 ~ Wi,1 + 4we,1 — w3,1 — wa2 = —h? far 
or, since wWe,o = g2,0 from the boundary conditions, 
—Wy,1 + 4we — w31 - w2.2 = —h? fos + g2,0- 


Continuing to move the five-point star from grid point to grid point, we de- 
velop one equation at a time, until, after placing the template over the last un- 
known, we arrive at the system of nine equations shown in Figure 9.7. Notice how 
the coefficient matrix is organized into blocks, with each block associated with the 
unknowns along one row of the computational grid. Overall, we say that A is a block 
tridiagonal matrix since the non-zero entries are clustered into the blocks along the 
main diagonal and in the primary sub- and superdiagonals. For this problem, we 
see that whether we move across or down the coefficient matrix, there are three 
blocks. This comes about because the computational grid (recall Figure 9.6) has 
three rows of unknowns. Further, each block is square with side dimension equal 
to 8, the number of unknowns along each row. The right-hand side vector contains 
—h? times the values of the function f, the nonhomogeneous term from the partial 
differential equation, plus the values of the boundary value function, g, when the 
grid point is adjacent to the boundary. 
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Figure 9.7 Organization of the discrete Poisson equations, with 


Dirichlet boundary conditions, for a 5 x 5 grid using lexicographic or- 


dering of the unknowns. 
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w= [ Wit War W341 | W142 W2a2 W32 | W143 W233 W3,3 ihe 


For a more general grid, containing N subdivisions along the horizontal di- 
rection and M subdivisions along the vertical direction, the coefficient matrix will 
have dimension (N — 1)(M — 1) x (N—1)(M — 1). The structure of the matrix 
remains block tridiagonal. In particular, 


D -J 
-I D -I 
-I D -I 
-I D 
with 
4 -1 
-l 4 -1 
D= 
~l1 4 -l 
-1 4 


There are M — 1 blocks both across and down the coefficient matrix, and each 
submatrix, D and J, has dimension (NV — 1) x (N —- 1). 


Solving the Equations 


Let’s start by establishing that the coefficient matrix, A, for the finite difference 
equations, as given by equation (3), is nonsingular. On each row of A, the diagonal 
element is equal to 4 and there are at most four additional non-zero elements, each 
with magnitude equal to 1. Consequently, A is diagonally dominant. Furthermore, 
certain rows of A, specifically those for which at least one point of the compu- 
tational template falls along the boundary of the domain, are strictly diagonally 
dominant. These conditions, however, are not sufficient to guarantee that a matrix 
is nousingular (see Exercise 2). 
Fortunately, A has another important property—the matrix is irreducible. 


Definition. The n x n matrix A is REDUCIBLE if there exists a permutation 
matrix P such that 


pap = Air: A1j2 | 
= 0 ; 


A2,2 


where A1,1 is an xr matrix, Aj,2 is an rx(n—r) matrix and Ag is an (n—r) x 
(n —r) matrix for some positive integer r. Note that the submatrices along 
the diagonal are square. A matrix is IRREDUCIBLE if no such permutation 
matrix exists. 
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Essentially, if the coefficient matrix of a system of equations is reducible, the 
solution may be found by breaking the original system into two smaller systems. 
When the coefficient matrix is irreducible, no such reduction in problem size is 
possible. It can be shown that a matrix which is irreducible, diagonally dominant 
and strictly diagonally dominant in at least one row is nonsingular (see Horn and 
Johnson [1]).. Thus, A is nonsingular, and for any nonhomogeneous term f and any 
Dirichlet boundary data g, the finite difference equations for the Poisson problem 
on a rectangular domain have a unique solution. 

Having established that the finite difference equations have a unique solution, 
how do we go about computing that solution? When N and M are both small, it is 
feasible in terms of both storage and computational cost to solve the finite difference 
equations using direct methods. For example, we can show that A is not only 
nonsingular, it is also symmetric positive definite (Exercise 1). Therefore, we might 
compute a Cholesky factorization or an LDL" factorization of A. Alternatively, 
we can exploit the block tridiagonal structure of the matrix. See Varga [2] and 
Lindzen and Kuo [3] for specific algorithms. 

In practice, however, neither N nor M will be small. When N and M are 
even of moderate size, it is recommended that the finite difference equations be 
solved using iterative methods. We will explore this issue in detail in Sections 9.3 
and 9.4. 


A Worked Example 


EXAMPLE 9.1 A Sample Problem 


To demonstrate the application of the finite difference method to the Poisson prob- 
lem, consider 
Au Au 
dn? * dy? 
on the domain R = {(z,y)\0<2<1,0<y< 1/2}. The Dirichlet boundary con- 
ditions are given by 


= —[2 + 1?x(1 — 2)| cos(ry) 


u(0,y) = u(l,y) = 0 
u(z,0)=a2(1—2), ulz,1/2) =0. 


Note that f(z, y) = —[2+7?2z(1—z)]cos(xy) and g is a piecewise function defined 
for (z, y) € OR by 


0, z=0 

0, z=1 
g(z,y) = a(l—2), y=0 

0, y=1/2 


Since the aspect ratio (the ratio of the height to the length) of R is one-half, 
we must choose N = 2M to achieve the same grid spacing in both directions. Let’s 
start with M = 2 and N = 4. This implies that the mesh spacing will be hk = 1/4. 
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The computational grid, with the unknowns numbered in lexicographic order, is 
shown below. 


e Value known from 
boundary condition 


Value unknown, to 
©__ be approximated by 
numerical method 


Successively placing the five-point star at the first, second, and third un- 
knowns produces the system of equations 


fu. — War = -Pfiitnotgo14+9,2 
—w1 + 4w.7 - war = —h? for + 92,0 + 92,2 
—w2,1 + 4u3i = —h? fz. +9304 941 +932, 
where 
fia = —[2 + 2? - 0.25(0.75)] cos(0.2577) 
fas = —[2 + 1? -0.5(0.5)] cos(0.257) 
3,1 = —[2 + 7? - 0.75(0.25)] cos(0.257) 
and 


91,0 = 0.25(0.75), 90,1 = giz =0 
92,0 = 0.5(0.5), 92,2 = 0 
93,0 = 0.75(0.25), 941 = ga2 = 0. 


Evaluating the entries of the right-hand side vector, the system can be written in 
matrix form as 


a a 80 wy,1 0.357672 
-1 4 -1 || waa | =| 0.447433 
0 -1 4 w3,1 0.357672 


The entries in the right-hand side vector have been displayed to six decimal places 
for convenience. 

The solution to this system is listed in the third column of the following table. 
The values of the exact solution, u(z,y) = x(1 — x) cos(ay), at the corresponding 
grid points are listed in the fourth column. Notwithstanding the crudeness of the 
grid, the approximate solution is in error by only slightly more than 1%. 
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Lj Vk Wyk Uj,k Absolute Error 
0.25 0.25 0.184151 0.132583 0.001568 
0.50 0.25 0.178934 0.176777 0.002157 
0.75 0.25 0.184151 0.182583 0.001568 


To investigate the order of convergence of the numerical method, the sample prob- 
lem was solved on successively finer grids. The root mean square (rms) error, 


N-1M-1 
Sojat dobar (Uj,e — WK)? 


(N —1)(M —- 1) 


and the ratio of the rms error to h? were calculated for each grid. Results are sum- 
marized in the following table. The values in the third column of this table suggest 
that the approximation error is roughly a constant times A”; that is, the approx- 
imation error is O(h”). Hence, we have numerical verification that the method is 
second-order accurate. 


Step Size, h rms Error rms Error / h? 
if4 1.786739 x1073 0.02859 
1/6 6.716379 x 1074 0.02418 
1/8 3.493104 x1074 0.02236 
1/10 2.137212 x10-4 0.02137 
1/12 1.441616 x10~4 0.02076 
1/14 1.037838 x 10~4 0.02034 
1/16 7.827723 x10-® 0.02004 
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EXERCISES 


1. Let A be the matrix given in equation (3). 
(a) Using the Gerschgorin Circle Theorem, show that 0 < \ < 8, where A is 
any eigenvalue of A. 
(b) Use the fact that A is nonsingular to conclude that all of the eigenvalues of 
A are positive. 


(c) Use Exercise 20 of Section 4.1 to conclude that A is symmetric positive 
definite. 
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2. Consider the matrix 


1 
1 
1 


OHFon 
o | 
me 

conor 


—2 
(a) Show that M is diagonally dominant with at least one row that is strictly 
diagonally dominant. 
(b) Show that M is singular. 
(c) Show that M is reducible. 
3. (a) Let A(x) be any antiderivative of a(x). Show that the change of variable 
wx,y) = eA@ay(z, y) transforms the equation 
Bu Ou Ou ew Ow A 
Seg ew OU eS F gw, gw _ Ae) 
a2 + dy? +a(z)a- f(z,y) into a? + ay =e F(a, y). 
(b) Let B(y) be any antiderivative of b(y). Show that the change of variable 
w(x, y) = e® u(x, y) transforms the equation 


Ou Ou Ou : ew Aw Bly) 
See oe os (x,y) into one f 


(c) What change of variable will transform 


Ou Ou Ou Ou 
into 
Ow Pw 


eae ee F 2 
Ox2 + Bye f(z, y)? 
How is f(z,y) related to f(z, y)? 


In Exercises 4-6: 


(a) Using second-order central differences to approximate all partial derivatives, con- 
struct the computational template corresponding to the indicated elliptic partial 
differential equation. Assume uniform spacing, kh, in both coordinate directions. 

{b) Assuming a rectangular domain and Dirichlet boundary conditions, specify suf 
ficient conditions for the finite difference equations to have a unique solution. 


& 2 
4. 53 + Soa tt, yulzy) = fey) 

au eu Ou ou 
5. ant + By? + a(t, y)a- + (ana, = f(x,y) 
6. 


a? au 

c1(2,¥) 5-5 a 2en(2,¥) 552 a cat. WA 
a a 

+ calm yar + clea) a + colt, uu = fay) 
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7. Consider the Poisson problem 


au Ou 
pn py eee LO eT 
i y y 
u(x, 0 =0, 1) = >— 0, = 7 ) =a 
(,0)=0, wa0= — oa, WO = pha, ww) =H, 
whose exact solution is y 
Wew)= ta gea ye 


(a) Taking N = M = 4, set up and solve the corresponding system of finite 
difference equations. 


(b) Numerically verify the second-order accuracy of the numerical method. 
8. Repeat Exercise 7 (but use N = M = 3 in part (a)) for the Poisson problem 


aru Oru 2,.2 
oat t Bye =? +y om R={(2,y|0<e<1da<y<1} 


u(z,0) =0, u(e,1)= x, u(0,y) =sin(ry), u(l,y) =e" sin(xy) + ae 


The exact solution for this problem is u(z, y) = e"® sin(ry) + 4(ay)?. 
9, Repeat Exercise 7 for the Poisson problem 


Fog + Bp = ~82c0s( de + Oy) on R= {(a,y]0<a<10<y<}} 


u(xz,0) =cos(4z), u(z,1) = cos(4a + 6), 
u(0,y) =cos(6y), u(1,y) = cos(6y + 4). 


The exact solution for this problem is u(z, y) = cos(4x + 6y). 
10. Repeat Exercise 7 for the Poisson problem 


2 2 
Sat oe TOomR={((eVi1<e<20<v<}} 


u(a,0) =2Ina, ulx,1) =In(e? +1), u(1,y) =In(y? +1), u(2,y) = In(y? +4). 
The exact solution for this problem is u(x, y) = In(a? + y’). 


11. (a) Over a computational grid with a spacing of Ax in the x-direction and 
Ay in the y-direction (with Az not necessarily equal to Ay), construct the 
computational template corresponding to the Poisson problem. Use second- 
order central differences to approximate all partial derivatives. 

(b) Assuming a rectangular domain and Dirichlet boundary conditions, con- 
struct the coefficient matrix for the system of finite difference equations 
corresponding to the template from part (a). 

(c) Under what conditions is the coefficient matrix from part (b) guaranteed to 
be nonsingular? 
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@ Value known from 
boundary condition 


RED value unknown, 
to be approximated by 
numerical method 


BLACK value unknown, 
to be approximated by 
numerical method 


Figure 9.8 Red-black ordering of unknowns on a 5 x 5 grid placed 
over a square domain. 


12. Using the results of Exercise 11, 
(a) Repeat Exercise 7(a) with N =5 and M =3. 
(b) Repeat Exercise 8(a) with N =4 and M =5. 
(c) Repeat Exercise 9(a) with N =3 and M =5. 
(d) Repeat Exercise 10(a) with N = 5 and M = 4. 


13. A sccond popular numbering scheme for the unknowns is the red-black ordering. 
In this scheme, the unknowns are divided into two categories: red unknowns 
and black unknowns. An unknown, w,%, is considered red whenever j + k is 
even and is considered black whenever j + & is odd. The red unknowns are 
numbered first, in lexicographic order, followed by the black unknowns, also in 
lexicographic order. The red-black ordering of the unknowns from Figure 9.6 is 
shown in Figure 9.8. 

(a) Write out the system of equations corresponding to the computational grid 
of Figure 9.8. 

(b) What type of structure can you identify in the coefficient matrix? Generalize 
this structure to a grid with N subdivisions in the horizontal direction and 
M subdivisions in the vertical direction. 


9.2 THE PCISSON EQUATION ON A RECTANGULAR DOMAIN, I: 
NON-DIRICHLET BOUNDARY CONDITIONS 


In this section we will examine how to handle non-Dirichlet boundary conditions 
for the Poisson equation on rectangular domains. Unlike one-dimensional boundary 
value problems, where the boundary conditions affect only the first and last row of 
the coefficient matrix, when working in two dimensions, the boundary conditions 
can. affect the coefficient matrix in a variety of ways. There are too many combina- 
tions of different types of boundary conditions specified over different portions of 
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the boundary to cover them all. Instead, we will discuss two specific cases in detail. 
Other cases will be treated in the exercises. 


Neumann Boundary Condition along Bottom Edge 


Suppose we are attempting to solve the Poisson equation on a rectangular domain 
with the Dirichlet boundary condition u(z,y) = g(x,y) specified along the sides 
and the top of the rectangle, but with the Neumann condition du/On = a(x) 
along the bottom. Along the bottom of the rectangle the outward normal vector 
points in the negative y direction, so 0/On = —O/Oy. Hence, in terms of the 
coordinate directions, the boundary condition along the bottom of the rectangle 
reads —Ou/Oy = a(x). 

The Neumann boundary condition along the bottom of the rectangle produces 
an extra row of unknowns in the computational grid. To handle these unknowns, we 
have exactly the same three options open to us as when we were working with one- 
dimensional boundary value problems. First, we could derive a new template using 
a second-order forward difference to approximate 07u/@y? for application along 
the lower boundary. This, however, would destroy the block tridiagonal structure 
of the coefficient matrix. To maintain the matrix structure, we could use a first- 
order difference formula, but this would lower the overall order of approximation. 
This leaves us with the third alternative: the use of fictitious nodes. This approach 
maintains both the structure of the coefficient matrix (block tridiagonal, each block 
tridiagonal) and the order of approximation. 

What effect will this have on the structure of the coefficient matrix? Let’s 
consider a 5 x 5 grid placed over a square domain—see Figure 9.9. With the 
Neumann condition along the bottom boundary, there are four rows of unknowns 
with three unknowns per row (compare this with the grid shown in Figure 9.6 of 
Section 9.1). Therefore, the coefficient matrix will be 12 x 12, organized as four 
blocks across and down, each block being 3 x 3. 

Taking the standard five-point star template and moving it along the first row 
of unknowns produces the set of equations 


—wr, +4ui9 —weo -wii = —~h fio +go0 
-wre —wio +4we0 —w30 —war = —A’ foo : (1) 
—WF3 —w20 +430 —w3i = —h? f3,0 +94, 


To elimimate the fictitious nodes from these equations, we turn to the Neumann 
boundary condition. Evaluating the boundary condition at (7,0) yields 


Ou 


-—|; . =a(x). 
dy (x1 ,0) 


Replacing the derivative with a second-order central difference formula, we find 


_ Val 7 VPI 


oh =a(t1) > wr = wi1+2ha(z1). 
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Value known from 
boundary condition 


Value unknown, to 
© be approximated by 
numerical method 


@® Fictitious nodes 


Figure 9.9 Computational grid for Poisson equation on a square do- 
main with a Neumann boundary condition along the bottom side and 
Dirichlet boundary conditions on the other three sides. Note the place- 
ment of the fictitious nodes. 


Repeating the process at the boundary points (r2, 0) and (3,0) gives 


WF2 = W2,1 + 2ha(x) 
Wrg = W3,1 + 2ha{z3), 


and substituting these last three equations into (1) yields 


+410 —W2,0 —2W,1) = —h? fr, + goo + 2he(2)) 
—w10 +4we0 —wW3,0 —2u21 = —h? foo + 2ha(x2) : (2) 
—w20 +4u30 —-2W31 = —h? feo + 94,0 + 2ho(z3) 
Let 
4 -l 0 
D=);)-1 4 -l]|, 
0 -1 4 


let J denote the 3 x 3 identity matrix, and introduce the notation 


T 
WRj = [ Wij W295 W3,4 | 


for the unknowns along the jth row of the computational grid. Equation (2) can 
then be written in the form 


~h? fio + 90,0 + 2ha(z2) 
Dwr — 2iwpR, = —h? foo + 2ha(xe) = bro. (3) 
—h? f3,0 + ga,o + 2hatzs) 
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Working the template across the remaining three rows of the grid, and taking into 
account values known from the Dirichlet boundary conditions where appropriate, 
generates the sets of equations: 


—h? fi + 90,31 

—Iwro + Dwr — Iwao = —h? fo =bri (4) 
—h* f+ 941 
—h? fi2+ 90,2 

—Iwp, + Dwr — Iwr3 = —h? foo = bre (5) 
—h? f3.0 + gaa 
—h? fis +903 +914 

—Iwre + Dwrp3 = —h? fas + 92,4 = br3. (6) 
—h? f3,3 + 94,3 + 93,4 


Combining (3), (4), (5) and (6) into a single system yields 


D —-2I WrRo bro 
~—f D —I WRI £2 bri (7) 
—I D -I WR2 bre : 
-I D WR3 bas 


Thus, the Neumann condition along the bottom of the rectangle has resulted in the 
coefficient matrix having an extra block across and down and the first superdiagonal 
block being —2/ rather than —J. Note that the coefficient matrix in (7) is irreducible 
and diagonally dominant with at least one row being strictly diagonally dominant; 
hence (7) is guaranteed to have a unique solution. 


A Worked Example 


EXAMPLE 9.2 Handling a Neumann Boundary Condition 
Consider the partial differential equation 
2 2 
a + a = —(24+ 1?2(1 — 2)| cos(ry) 


on the domain R = {(z,y)|0<2<1,0<y< 1/2}. The boundary conditions are 
given by 


u(0, y) = u(t, y) =0 
3 (0,0) = 0,u(z, 1/2) = 0. 
Note that this is the same problem that we solved in Section 9.1, but with the 


Dirichlet condition along the bottom of the domain replaced by a Neumann condi- 
tion. 
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e Value known from 
boundary condition 


Value unknown, to 
© be approximated by 
numerical method 


@ @ © @ Fictitious nodes 


The first grid we will use for this problem is shown above. We have taken 
N =4and M = 2. The mesh spacing is h = 1/4. Note there are six unknowns, 
organized by rows into two groups of three. , 
Comparing the present problem with the model problem, we find 


f(z,y) = —[2 + 4?2(1 ~ 2)] cos(zy) 
g(x,y) =0 and a(z)=0. 


The blocks which compose the right-hand-side vector are therefore 


~h? fio + g0,0 + 2ha(#1) 42 +a *(0. 25)(0.75)] +0+0 
bro = —h? fo, + 2ha(x2) 7 1l2 + 7*(0.5)(0.5)] +0 
—h? fs,o + go + 2ha(zs) 76 Lee x? (0.75)(0.25)] +0+0 
anc 
—h? fia + go + 91,2 i6(2 +77(0. eo 75)] cos(F) +0+0 
bri = —h? for + 92,2 = 42 +7 200. 5)(0. 5)| cos(F) + 0 ; 


—h? fx + 9a + 93,2 2" + 1(0.75)(0.25)] cos() +0 +0 


so the complete system of equations we arrive at for this grid is 


4 ~1 0/-2 6 0 wi0 0.240659 
~1 4 -1/0 -2 0 w2,0 0.279213 
0 -1 4/0 O -2 w3o | _ | 0.240659 
wia | | 0.170172 
W2,1 0.197433 
W3,1 0.170172 


The entries in the right-hand-side vector have been displayed to six decimal places 
for convenience. 

Here is a table listing the solution to the above Sucterh and comparing this 
approximate solution to the exact solution, u(z,y) = x(L - a) cos(y). The errors 
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are roughly two and a half times larger than the errors introduced when Dirichlet 
conditions were specified around the entire boundary, but the accuracy is still quite 
good considering the coarseness of the grid. 


2; Uk Wyk Uj,k Absolute Error 
0.25 0.00 0.192371 0.187500 0.004871 
0.50 0.00 0.256771 0.250000 0.006771 
0.75 0.00 0.192371 0.187500 0.004871 
0.25 0.25 0.186027 0.132583 0.003444 
0.50 0.25 0.181564 0.176777 0.004787 
0.75 0.25 0.136027 0.132583 0.003444 


To demonstrate that the numerical method retains its second-order rate of conver- 
gence, the test problem was solved using successively finer grids. For each grid, 
the rms error in the approximate solution and the ratio of the rms error to h? 
were computed. Examining these latter values, we find strong evidence that the 
numerical scheme is O(h?). 


Step Size, A rms Error rms Error / A? 


1/4 4.829513 x107% 0.07727 
1/6 1.889341 x 1073 0.06802 
1/8 9.983822 x1074 0.06390 
1/10 6.156164 x 1074 0.06156 
1/12 4.170598 x 1074 0.06006 
1/14 3.010464 x 10-4 0.05901 
1/16 2.274558 x 10-4 0.05823 


Neumann Condition along the Right Side and Robin Condition along the Top 
Edge 


As a second case, suppose that the Neumann boundary condition, 


Ou 
ta aly), 


is specified along the right side of the domain and the Robin boundary condition, 


pla) a +(z)u = r(x), 


is specified along the top edge. Dirichlet boundary conditions are provided along 
the other two sides of the rectangle. To examine the effect of these boundary 
conditions on the finite difference equations, we will once again consider a 5 x 5 
grid. Values along the left side and the bottom edge of the grid are given by the 
Dirichlet boundary conditions. With a non-Dirichlet boundary condition along the 
top of the grid, there will be four rows of unknowns. The non-Dirichlet condition 
along the right side dictates that each of these rows will contain four unknowns. 
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Figure 9.10 Computational grid for Poisson equation on a square do- 
main with a Neumann boundary condition along the right side, a Robin 
boundary condition along the top, and Dirichlet boundary conditions on 
the remaining two sides. Note the placement of the fictitious nodes. 


There will also be a total of eight fictitious nodes needed, four across the top and 
four down the right side. The complete computational grid, with fictitious nodes 
and lexicographic ordering of the unknowns, is shown in Figure 9.10. 

Moving the five-point star across the first row of unknowns produces the set 
of equations 


4wiy.  —Waa ~Wy,2 = -hfir+%,0+ 901 
“Wr +4wW21  —w3,1 —W2,2 = —h? for + 92,0 
—Us,, $431 —Wa1 —w32 = —h? fei + 93,0 
—w31  +4We1 —wa2 ~wri = —h? far tga 


(8) 
The value wr, must be eliminated from the last equation in (8). This is accom- 
plished using the Neumann boundary condition. Along the right side of the domain, 
the outward normal vector points in the positive direction; hence, 
a] 
zs =a(y) is equivalent to - = ay). 
Using the second-order central difference formula for 0u/Oz, it follows that 
Wry = war + 2Zhalyr). 
With this expression for wy:, (8) can be written in the form 


—h? fiat 91,0 + 90a 
—h? fo. + 92,0 
—h? fz + 93,0 

—h? fa + 9a,0 + 2halyr) 


D'wp—-lwrp = =bri, (9) 
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where 
4 -1 0 0 
, -l1 4 -1 90 
= 0 -1 4 -l 
0 QO -2 4 


Proceeding in a similar manner, the sets of equations derived from the second, 
third, and fourth rows of unknowns are 


—h? fiat 90,2 
—h? foo 
—h? fa,0 

L —h? fan + 2ha(y2) | 


—Iwpy + Diwre — Iwp3 = = bro, (10) 


—Iwp2 + D'wp3 — Iwas = Rhy 3 = bps, (11) 


and 


~Iwp3 + D'wrg - lwrar = (12) 


—h? faa 
—h? faa + 2ha(ya) 


In equation (12), wrr is a column vector that contains the values at the fictitious 
nodes along the top of the grid. To eliminate this vector, we invoke the Robin 
boundary condition, which, along the top edge of the rectangle, is equivalent to 


pla) + a(a)u=r(2), 


After evaluating this equation at (x1, ya), replacing the partial derivative by its 
second-order central difference approximation, replacing ui,, by wi,4 and solving 
for wrs, we find 
2ha(z1) 2hr(x1) 

Wi4 t A 
(21) (21) 


Similar expressions are obtained for wre, wry and wrg. Substituting these expres- 
sions into (12), the final set of finite difference equations becomes 


WP5 = W143 


—h? fi 4 + 90,4 + 2hr(z1)/p(21) 
—h? fog + 2hr(x2)/p{te) 
—h? fg + 2hr(23)/pls) 
—h? fag + 2ha(ya) + 2hr(xa)/p(za) 


—2Iwp3 + D" wre = =bra, (13) 
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where 
44 on it) -l 0 0 
p(x) 
= 4 +n 2\%2) -1 0 
D'= p(z2) 
0 = 4+ 2n (2) ey 
p(x) 

0 0 —2 4+ on aia) 
P(Z4) 


Combining (9), (10), (11), and (18), the entire system of finite difference equations 
takes the form 


ae | WRi bri 

-I Dp’ —I WR2 ve Dro 
—I dD! —I WR3 > br Co 

-22 Dd" WwW RA brs 


Another Worked Example (An Application) 


EXAMPLE 9.3 Temperature Distribution in a Column Supporting an 
Industrial Furnace 


A long column of fireclay brick, which has a square cross-section with side length 
of one meter, supports a large industrial furnace. During steady-state operation, 
three faces of the column are maintained at 500 K, while the fourth face is exposed 
to convective heat transfer with heat transfer coefficient h = 10 W/m?- K and 
‘ambient temperature T,, = 300 K. The thermal conductivity of fireclay brick is 
k:= 1 W/m- K. 

Let’s orient the square in the first quadrant, with one corner at the origin and 
the top edge as the convective surface. Taking into account the symmetry of the 
problem about the vertical line = 1/2, the temperature in the column, T(z, y), 
satisfies the Poisson problem 

92 2 
far tae O0<2<1/2,0<y<1l 


oT fl oT 
= = 500, —[- =0, (2,1) + 10T(z,1) = 3000. 
(Oy) = T(2 0) = 500, 5° (Fy) =o, F(a,1) + 107(2,1) 

Consider the 5 x 3 grid shown in Figure 9.11. Note that the mesh spacing is 
h:= 1/4. Since the grid has four rows of unknowns, the system of finite difference 
equations will be organized into four sets of equations. Each set will contain two 
equations since there are two unknowns per row. 
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Figure 9.11 5x 3 grid for Example 9.3. 


Comparing the current problem with our model problem, we see that 


Thus, the matrix D” will take the form 


a 4 + 2hq(x )/plx ) a 
Be | ana tates | 
Z | 4 + 2(0.25)(10) /1 =f 
= 2 4 +.2(0.25)(10)/1 |? 


while the components of the right-hand side vector will be 


bra! hat gotgo: | _ | 0+500 +500 
ah —h? fo + 92,0 + 2ha(yi) | 0+500+0 
Be —h? fro + 90,2 = 0 +500 
R= | _p2 fy 9 + 2halye) | 0+0 
eee —h? fi + 90,3 ] - 0+ 500 
ss —h? fz.5 + 2halys) | 0+0 


wae —h? fa + 90,4 + 2hr(w1)/p(ar) _ 0 + 500 + 2(0.25)(3000)/1 | 


—h? fog + 2ho(yg) + hela) /p(z2) 0 +0 + 2(0.25)(3000) /1 
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The complete system of finite difference equations is then 


4 -1/-1 0/0 0/0 0 Wil 1000 
—-2 4/0 -1/0 0/0 0 W2,1 500 
0 wi,2 500 
0 0 w2 | _ 0 
0 -1 0 |} wis | | 500 
0 Of] 0 -1/-2 4/0 -1 w2,3 0 
0 0]0 oO f-2 0] 9 =1 WA ~ 2000 
G2 60/0 “0 | 0 -2|/-2 9 | | wes 1500 


Solving this system produces the approximate temperature values listed in Fig- 
ure 9.12. Remember that this solution gives the temperature distribution in just 
the left half of the brick column. To obtain the distribution on the other half of 
the square, we simply need to reflect the values on the left half across the axis of 
symmetry at @ = 1/2. 

Figure 9.13 displays a contour plot of the temperature distribution within the 
brick column. A 17 x 9 grid was used both to improve the accuracy of the approx- 
imate solution and to increase the resolution of the contours. The approximate 
solution was computed for the left half of the square, and the output values were 
then reflected across z = 1/2 to create the full plot. 


EXERCISES 
1. What conditions must the functions p and q satisfy for the coefficient matrix in 
equation (14) to be diagonally dominant? 


In Exercises 2-6, write out the system of finite difference equations corresponding to 
the Poisson problem on a rectangular domain with the indicated boundary conditions. 
Use a 4 x 4 grid. 
2. Dirichlet conditions along the top edge and right side of the domain and Neumann 
conditions along the left side and bottom edge. To be specific, suppose 


u(x,y) = 9(2,y) along top edge and right side, 


a = a(y) along left side, and 
du = B(x) along bottom edge. 
On 


3. Dirichlet boundary conditions along the top and bottom edges, a Neumann 
boundary condition along the left side, and a Robin condition along the right 
side. To be specific, suppose 


ulx,y) = 9(2,y) along top and bottom edges, 


— =aly) along left side, and 


Cw + q(yju = r(y) along right side. 
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T7 = wy 4 = 356.99 K Tg = w2,4 = 339.05 K 


T5=w3=43695K Te =wo3=418.74K 


T3 = W1.2=47207K = T4= wo = 462.01 K 


T= w),1=48930K — -T) = wa) = 485.15 K 


500K 500K 


Figure 9.12 Approximate temperatures in a column supporting an 
industrial furnace. 


0 Of 0.2 03 04 0.6 06 0.7 08 Og 1 
x (meters) 


Figure 9.13 Teraperature contour lines for the fireclay brick support 
column of an industrial furnace. Temperature values are in Kelvin, and 
the lower edge and sides are maintained at 500 K. 
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4. Neumann conditions along the top edge and right side of the domain and Robin 
conditions along the left side and bottom edge. To be specific, suppose 


Ou ; F 
os a(y} along right side, 
Du B(z) along top edge 
8n - & 18) ge, 


Ou : 
puy)a- + a(y)u=ri(y) along left side, and 
Ou 
po(t) 5 + @(x)u = ro(x) along bottom edge. 
5. Dirichlet boundary condition along the bottom edge, Neumann boundary condi- 
tions along the left side and the top edge, and a Robin condition along the right 


side. To be specific, suppose 


u(z,y) = o(x,y) along bottom edge, 


ou = aly) along left side, 
ou = A(z) along top edge, and 
du ; ; 
Plu)a- + (yu = r(y) along right side. 


6. Dirichlet boundary condition along the bottom edge, Neumann boundary condi- 
tions along the left side and the right side, and a Robin condition along the top 
edge. To be specific, suppose 


u(x, y) = 9(2, y) along bottom edge, 


ou = aly) along left side, 
oe = B(y) along right side, and 
n 

p(2)o + glz)u= (a) along top edge 


7. Consider the Laplace equation 


ui ou 


a yz = On R= {(z,yl0<e<10<y< 1} 


subject to the boundary conditions 


y 
u(x, 0) = 0, WOU ee ge 

Ou Ay Ou 2 ee 1 
ae = Gy ye ay) + apap Y= Talal 
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Numerically verify that the finite difference method is second-order accurate for 
this problem. The exact solution is 


> y 
io lar paneer 


8. Repeat Exercise 7 for the Poisson problem 


Bu ay 

at ta +y on R={(2,y)0<2<10<y<]} 
fol 
Felz0) = ne", ula) = 50" 


u(0,y) = sin(my), (ly) =e" sin(ny) + Sy" 


The exact solution for this problem is u(z, y) = e™” sin(ry) + yxy)’. 
9. Repeat Exercise 7 for the Poisson problem 


o? 2 
Fa + Goh = ~B ool ~ 6y) on R=({(z,ylO0<2<1ld<y<1} 


u(z,0) = cos(423, aa 1} + uf{a, 1) = cos(4z + 6) ~ sin(4zx + 6) 


Ou . Ou _ ; 
Fe(0,y) = ~4sin(6y), S¥(1,u) = ~Asin(6y +4). 


The exact solution for this problem is u{z, y) = cos(4x + 6y). 
10. Repeat Exercise 7 for the Poisson problem 
fu Ou 
Bat + 57 = 0on R= {(q,yl1<2<2,0<y<1} 
OH o,0) = 0, u(e,1) =In(x? +1), May) =a, u(2,y) =Inly? +4). 
Oy Z > a , Ox y y? + 1 , 
The exact solution for this problem is u(z, y) = In(a? + y?). 


11. A long bar has a rectangular cross section, 0.4 meters by 0.6 meters on a side. 
The top and bottom edges of the bar are maintained at a temperature of 200° C, 
while the left side is insulated. The right side is subjected to convective heat 
transfer with a fluid whose temperature is 30°C. The temperature distribution 
within the bar, T(z, y), satisfies the Poisson problem 


oT OT 

aie wpe eee = 4,0 : 

But * Bye Oon R= {(z,y)|0<2<04,0< y < 0.6} 
T(x, 0} = T(x, 0.6) = 200° C 

OF Shale at oe 4 AT = 30h along x = 0.4. 

On on 


Approximate T(z, y) using the finite difference method with a uniform mesh_ 
spacing of 0.1 meters. Take k = 1.5 W/m - K for the thermal conductivity of 
the bar and A = 50 W/m?. K for the convective heat transfer coefficient. 
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12. 


13. 


14. 
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A solid bar having a rectangular cross-section one inch wide and two inches high 
is subjected to a constant rate of twist along its axis. The stresses established in 
the bar can be related to the nondimensionalized Prandtl stress function, (2, y), 
which, for the geometry of this problem, satisfies the Poisson problem 


ry oy 
Qn + Age = bon Ra {ey 0<e<1/2,0<y<}} 
™ 10,9) ™ \(x,0) 


Approximate ¢{z,y) using the finite difference method with a uniform mesh 
spacing of h = 0.10 inches. 


The temperature distribution, T(z,y), within a long bar of rectangular cross- 
section satisfies the Poisson problem 


or or 
oe ap Me re ets) 
T(#, 0) = 200°C 
a = 0 along x =0 and y = 0.3 
foe +hT = 30h along z = 0.4. 
an 

Approximate T(x, y) using the finite difference method with a uniform mesh 
spacing of 0.05 meters. Take k = 1.5 W/m - K for the thermal conductivity of 
the bar and h = 50 W/m’. K for the convective heat transfer coefficient. 
Consider the Poisson problem 

Gu au 

Bet gi Se) OE le See eS ae (15) 


Gu A 

— = alz,y) for (2, y) € OR. 

Bq = oles) for (2,9) 

Note this problem has Neumann conditions specified around the entire bound- 

ary. 

(a) Show that in order for (25} to have a solution, the functions f and a must . 
satisfy the consistency condition 


[ [seo tam f ale,n)at 


(b) Show that if u(z,y) is a solution to (15) then u(z,y) + k is a solution for 
any constant k. Hence, when (15) has a solution, it is not unique. 

(c) Construct the system of finite difference equations associated with (15). 
Show that the coefficient matrix is singular. 

(d) What condition must f and o satisfy in order for the system of finite differ- 
ence equations developed in part (c) to have a solution? 
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9.3. SOLVING THE DISCRETE EQUATIONS: RELAXATION METHODS 


We have seen that approximating the solution of the Poisson equation using the 
finite difference method requires the solution of a linear system of algebraic equa- 
tions. If the computational grid that is placed over the rectangular domain has 
N subdivisions in one coordinate direction and M subdivisions in the other and 
Dirichlet boundary conditions are specified, then the coefficient matrix for the linear 
system will have dimensions 


(N —1)(M - 1) x (N—1)(M - 1). 


With non-Dirichlet boundary conditions, the matrix is even larger. When N and 
M are small, direct techniques for the solution of linear systems, such as Gaussian 
elimination or special factorization algorithms, are viable choices. It was precisely 
these types of techniques that. were used in Sections 9.1 and 9.2. 

When N and / are even of moderate size, however, iterative techniques for 
the solution of linear systems are often more efficient alternatives than direct tech- 
niques. First, with an iterative technique, there is no need to store the coefficient 
matrix. All that must be known is the structure of the equations; that is, the 
structure of the computational template. Second, even though multiple iterations 
must be performed, an iterative solution typically requires fewer total operations. 
An added bonus is that iterative techniques are generally insensitive to roundoff 
error. 

Any time an iterative method is used to solve the discrete Poisson equations, 
there are actually three solutions floating around of which we must be aware. The 
first of these is the true solution of the partial differential equation, u. The other 
two solutions are related to the linear system of equations: w” is the true solution 
and w* the approximate solution. w’ and w* are referred to as grid functions since 
they consist of the values of some function at all points of the computational grid. 
The superscript, h, is used to identify the mesh spacing of that grid. Among these 
three solutions, there are also two different types of errors with which we must be 
concerned. Between u and w’, there is the second-order discretization error that 
was discussed in the previous sections. Toward the end of this section, the iteration 
error between w* and 2 will be investigated. 

In this section, we will consider the three iterative techniques presented in 
Section 3.8, namely, the Jacobi method (simultaneous relaxation), the Gauss-Seidel 
method (successive relaxation) and the SOR method (successive overrelaxation). 
These three methods, together with a host of others, are collectively known as point 
relaxation schemes. Given an approximate solution, @”, to the finite difference 
equations, a point relaxation scheme sweeps through the grid updating the values 
of tz one by one to form a new approximate solution w", Throughout this and 
the next section, tildes (~) will be used to denote before sweep values and overbars 
to denote after sweep values. 
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Dirichlet Boundary Conditions 


Let’s start by considering the Poisson problem with Dirichlet conditions specified 
around the entire boundary. In this case, the generic finite difference equation 


: Se, 2 
—Wj—1jk — Wi41,k — Wy,k-1 — Wy kt + 40 ;j,4 = —h° Fj & (1) 


applies at every grid point. From Exercise 1 of Section 9.1, we know that the co- 
efficient matrix for the complete system of finite difference equations is symmetric 
positive definite. Consequently, the Jacobi method, the Gauss-Seidel method and 
the SOR method are guaranteed to converge for any choice of initial approxima- 
tion wr. 

Applying the Jacobi method to equation (1), we find that the update equation 
for the unknown at the (j,£) location of the grid is given by 


: Thos 2 ~ % 
Wye = 7 (jae + Dypae + Dina + Byer — A? hyn) - (2) 


Before we can write out the update equation for the Gauss-Seidel method, we have 
to specify the precise order in which the unknowns will be processed. If we work 
in lexicographic order, then by the time the sweep reaches location (j,k) of the 
grid, the unknowns at locations (j, -— 1) and (7 — 1, k) have already been updated. 
Thus, the Gauss-Seidel equation for w,,, takes the form 


e 1,_ 2 _ - 
Dye =F (105-1, + Wy+ije + Wyk + Oje41 — Wie): (3) 


Finally, the update equation for the SOR method is given by 
W 


q (jae F Wy pik + Dyer + Wy kt — h* fix), (4) 


Wyk = (1 — wedge + 
where w is the relaxation parameter. For the Poisson problem on a rectangular 
domain with Dirichlet boundary conditions, the optimal value for the relaxation 
parameter is given by 


4 


ae 2+ 4/4 —[cos(r/N) + cos(r/M))? 
(see Ortega [1)). 


Examining equations (2), (3), and (4), we see that the Jacobi method and the 
Causs-Seidel method use exactly the same number of arithmetic operations per grid 
point, while the SOR method requires two additional operations per point. In terms 
of storage, with the Jacobi method, the value of ti;,, will be needed to update other 
values in the grid, so %;,. cannot be overwritten by ty,z. We. will therefore have to 
meintain two storage structures: one for w and one for w”. In contrast, for both 
the Gauss-Seidel method and the SOR method, w,;,, can be overwritten by W3,k3 
hence, only one storage structure need be maintained. As a final remark, we note 


(5) 
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that when using the Jacobi method on a vector or parallel machine, all entries 
in w" can be computed sirnultaneously. With lexicographic ordering, neither the 
Gauss-Seidel method nor the SOR method is vectorizable; however, with red-black 
ordering of the unknowns (see Exercise 13 of Section 9.1), all of the red unknowns 
can be computed simultaneously followed by all of the black unknowns. 


EXAMIPLE 9.4 Point Relaxation Schemes in Action 


Consider the sample Poisson problem 


Cu du 
a +a =0 
a2 Bap 


over the rectangular domain R = {(z,y)|1 <a < 2,0 < y < 1} and subject to the 
Dirichlet boundary conditions 


u{z,0)=2Inz, u(x,1) = In(a? +1) 
u(l,y) =In(y? +1), u(2,y) = In(y? + 4). 


The exact solution to this problem is u(x, y) = In(z? + y?). 

For various values of N, an N x N computational grid is placed over the 
domain. For each grid and each relaxation scheme, w” is initialized to zero at all 
interior grid points, and sweeps are terminated when 

L<j<N TIERS N=1 {5k 7 Djne| 
falls below the tolerance 5x 10~!!. The optimal value of the SOR method relaxation 


parameter, as given by equation (5), is used in each case. The number of sweeps 
needed to achieve convergence is tabulated below. 


N Jacobi Gauss-Seidel SOR (w = wopt) 
5 107 BYé 23 
10 422 220 47 
20 1592 828 91 
40 5940 3088 179 


We make two important observations from the data in this table. First, on 
each grid, the Jacobi method requires roughly twice the number of sweeps to achieve 
convergence as does the Gauss-Seidel method, and the SOR method requires signif- 
icantly fewer sweeps than the Gauss-Seidel method. Second, each time the grid is 
refined, the number of sweeps needed to achieve convergence increases for all of the 
relaxation schemes. In particular, each time N is doubled, the number of sweeps 
needed by the Jacobi method and the Gauss-Seidel method increases by roughly 
a factor of four, while the number of sweeps needed by the SOR method roughly 
doubles. We will examine these issues in more detail toward the end of the section. 
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Non-Dirichlet Boundary Conditions 


To handle problems with non-Dirichlet boundary conditions, remember that each 
grid point gives rise to its own finite difference equation and corresponding update 
equation. We just have to be careful with our bookkeeping to make sure that the 
correct equation is applied at each grid point. 

To illustrate this process, consider the Poisson problem 


ru Fu 
bez + a2 f(z,y) 

on the rectangular domain R = {(2z,y)|a <2 < b,c < y < d} subject to the bound- 
ary conditions 


u(a,y) =gily), u(x,c) = g2(x), Fld) = a(z) 


and 
o(u) 5"(b,u) + alu)ulb,) = rv). 

The computational grid for this problem is displayed in Figure 9.14. Note that 
the grid points have been separated into four distinct types. Grid points of the 
same type give rise to finite difference equations of identical structure; grid points 
of different types give rise to different equations. It is important to realize that 
the grid points along the right edge of the domain and those along the top edge 
have been classified as different types because of the location of the fictitious node 
within the computational template, not because one edge has a Robin condition 
and the other a Neumann condition. 

Since we have identified four types of grid pomts, there will be four funda- 
mentally different update equations with which we will have to work. Here, we 
will write out the update equations for the Jacobi method only. The equations 
for the Gauss-Seidel method and the SOR method will be considered in the exer- 
cises. When the computational template is placed at any Type I grid point, we 
obtain equation (1). Accordingly, the update equation at any Type | grid point is 
equation (2). Along the right edge of the domain, we have to take into account 
the Robin boundary condition. The resulting update equation at the Type II grid 
points is 


| 
One = a 3 2a) (26n-14 yey ACL ete eee MIs) 
(yk) P(yK) 6) 


Taking into account the Neumann condition specified along the top of the domain, 
we find 


(to jim + Wy 41,04 + 20B;,n7-1 + 2ha(x;) — h? fy) (7) 


Ble 


Wim = 
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e Value known from 
boundary condition 
Value unknown, to 

© be approximated by 
numerical method 


® Fictitious nodes 


Figure 9.14 Computational grid for a problem with non-Dirichlet 
boundary conditions specified along the top edge and the right side of 
the domain, 


at the Type III grid points. Finally, 


& 


NM = 


gM) ]~* (5. rym) | 9. 
i + an ttiag) (2.10 + a + 2 .M-1 + 2ha(rn) _ Hf) 
8 


Having determined the appropriate update equations, how should we organize 
the calculations to perform each relaxation sweep? Perhaps the best way to proceed 
is to take our cue from the grid itself. The first M —1 rows of unknowns all contain 
N-1 TypelI grid points followed by a Type II grid point, The top row of unknowns 
then starts with N — 1 Type III grid points and is terminated by a Type IV grid 
point. Thus the structure of the grid suggests the algorithm 


fork=1ltoM-1 
for j =1to N-1 
calculate w,, using equation (2) 
end 
calculate @y% using equation (6) 
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end 
forj =ltoN~1 
calculate w, yy using equation (7) 
end 
calculate @w,1 using equation (8) 


EXAMPLE 9.5 Relaxation and Non-Dirichlet Boundary Conditions 


Consider the Poisson problem 


Pu Ou | 
dat t ay OM R= (ey l0<e<10<y<} 
¥ 
0) = = 
u(z, ) Q, u(0, y) 1+y? 


Ou _ a(2+2) du 4 a 
Bye! = (4a? By hw) ni To pty) =0. 


The exact solution for this problem is 


¥y 

u(z,y) = ———.—.. 
oO Tare 

For various values of N, an N x N computational grid is placed over the 
domain. For each grid and each relaxation scheme, #” is initialized to zero at all 
grid points not along the left side or the bottom edge, and sweeps are terminated 
when 

1<j SN TSKEN [Bie — Wi. 

falis below the tolerance 5 x 107!+. There is no formula, like equation (5), which 
gives the optimal value of the SOR method relaxation parameter in terms of the 
number of grid subintervals N and M for problems involving non-Dirichlet bound- 
ary conditions. Though there are algorithms for estimating Wop, as the sweeps are 
being carried out (see, for example, Thomas [2] and Hageman and Young [3]), here, 
we will simply run the SOR method with 


2 2 


W = Wy = =———— and w=w2=——. 
1l+sin# 1+sin oy 


Note that w, is the value obtained from (5) for an N x N grid. For this problem, 
the performance of the SOR method with w = wy turns out to be nearly optimal. 


Sweeps to Achieve Convergence 


N Jacobi Gauss-Seidel SOR (w=w ) SOR (w = wz) 
5 306 157 89 37 
10 1146 591 178 72 
20 4249 2202 346 145 


40 15621 8138 667 287 
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From the data in the above table, we can make exactly the same observations 
as we made earlier. On each grid, the Jacobi method requires roughly twice the 
number of sweeps to achieve convergence as does the Gauss-Seidel method, and the 
SOR method (even with far from optimal performance) requires significantly fewer 
sweeps than the Gauss-Seidel method. Also, each time N is doubled, the number 
of sweeps needed by the Jacobi method and the Gauss-Seidel method increases by 


roughly a factor of four, while the number of sweeps needed by the SOR method 
roughly doubles. 


Convergence Analysis 


We've observed in two examples that all three of our relaxation schemes converge 
toward the true solution of the finite difference equations more slowly when the mesh 
size decreases! This behavior has an extremely important practical consequence. 
A smaller mesh size corresponds to smaller truncation error. Thus, there is an 
inherent tradeoff between truncation error and the convergence rate of relaxation 
schemes. Let’s see if we can establish a theoretical basis for this phenomenon. 

When assessing the performance of a relaxation scheme, the main issue is the 
amount by which a single sweep through the grid reduces the difference between 
the true solution of the finite difference equations and the approximate solution 
(i.e., the iteration error). This quantity is measured by the asymptotic convergence 
rate, uw. If we let 

aww and =u -w 

denote the iteration error prior to commencing a sweep and upon completion of 
that sweep, respectively, then 


(|5"|| 


4 = asymptotic value of Ira 

In this context, “asymptotic value” refers to the value approached by the indicated 
ratio as the iteration nears convergence. Any appropriate norm can be applied to 
the grid functions 6” and #*. The ideal situation is y < 1, which indicates that 
each sweep produces a substantial reduction in the iteration error. When yu ~ 1, 
little error reduction is generated by each sweep as convergence nears. 

Although throughout this section we have not formulated any of the methods 
in this way, we know (from Section 3.8) that the Jacobi method, the Gauss-Seidel 
method and the SOR method can all be written in the form 


p= TH +c, - 

for some iteration matrix ¢ and some vector c. We also know that each method 
converges linearly with an asymptotic error constant equal to p(T), the spectral 
radius of the iteration matrix. Thus, as the iteration nears convergence, 
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Given our definition of the asymptotic convergence rate, it follows that u = p(T). 
To carry our analysis further, let’s consider the specific case of the Poisson 
problem on the unit square (ie, R = {(2,y)|O<2<10< y <1}) subject to 
Dirichlet boundary conditions. Further, suppose we place an N x N grid over the 
domain. For this problem, it can be shown (see Stoer and Bulirsch [4]) that 


plTjac) = 00s = 1 ~5 (2) (9) 
and , i 
hice es (cos =) wie (=) ; (10) 


were Tiac and Tg, are the Jacobi and Gauss-Seidel iteration matrices, respectively. 
For the SOR method, 


2 
a7 Se oe eee 
or 1 +sine 
With w = wont, 
1l-sing t 
AT soc) = Wopt ee ee (11) 


l+sinZ - N? 
where Toor is the SOR iteration matrix. 

So how do we relate this information to the number of sweeps? Asymptot- 
ically, we know that one sweep reduces the norm of the error by the factor p(T). 
‘Therefore, if we wish to reduce the error by a factor ¢, say, we will need to perform 
s sweeps, where [p(T)|* = ¢. Solving for s gives 


Iné 
In p(T) 


Substituting (9) and (10) into (12), we find 


SR 


(12) 


— Ine —2IN€, 15 
Sjac © a 3 N 
3 (H) i 
and 
_ ime —In€ yy 
PONT pas ome 
(¥) 


From these last two expressions we see that we should expect the Jacobi method 
to require twice as many sweeps as the Gauss-Seidel method, and that the number 
of sweeps needed by both methods is O(NV?). In contrast, upon substituting (11) 
into (12), we obtain 


so that the number of sweeps needed by the SOR method is O(N). 
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Some Thoughts on the Programming of Relaxation Methods 


When using a relaxation scheme to solve the finite difference equations associated 
with a Poisson problem, the code for the routine can be greatly simplified and the 
efficiency vastly improved by keeping two simple ideas in mind. First, dimension 
the matrix for storing the values of w;% to be of size (NV +1) x (M + 1), even if 
Dirichlet conditions are specified along a portion of the boundary. Place any known 
values of the solution into the appropriate locations in the matrix. At the expense 
of at most 2(N + M +1) memory locations, the need to treat grid points that are 
adjacent to a Dirichlet boundary condition separately from all of the other interior 
grid points is removed. 

Second, outside the iteration loop, construct a matrix which contains the value 
of the nonhomogeneous term, f(x,y), at each grid point where the solution is not 
known from a Dirichlet boundary condition. Then, inside the iteration loop, use a 
table lookup, rather than a function evaluation. This approach will save O(NM) 
function evaluations per sweep. Additional operations can be saved by multiplying 
each fj, by h? when the matrix is constructed. How does this save operations? 
‘Take a close look at the update equations (2), (3), (4), (6), (7), and (8). Each re- 
quires the value h f;,., not just fj,.. The same will be true of any update equation. 
Hence, if the value h?f;,, can be looked up, we will reduce the computational cost 
of our relaxation scheme by one arithmetic operation per point per sweep. 


An Application Problem: Electric Potential with Unusual Boundary Conditions 


Consider the thin metal plate of length L and height A shown in Figure 9.15. The 
top edge of the plate is maintained at an electric potential Vo while the bottom 
edge is maintained at an electric potential —Vo. No charge is allowed to enter or 
leave the plate along the other two sides. A uniform magnetic field of intensity Bo 
acts normal to the plate and into the page. Our objective is to approximate the 
steady-state electric potential, V(z,y), throughout the plate. 

At steady state, there will be no volume charge density within the plate, so 
from Maxwell’s equations, we have ; 


+=* =0, (13) 


where E, and E, are the x and y components of the electric field. The electric 
potential is related to the electric field by 


OV av 
ene ie eee 14 
Ey Fa and = By By (14) 
Substituting (14) into (13) yields 
ev + eV =0 
x2 | Ay? 


Along the top and bottom of the plate, we have the Dirichlet boundary conditions 
V(c,#) = Yo and V(z, 0) = —V. 
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® 8, 


Electric potential Vy 


Electric potential — Vy L 


Figure 9.15 Geometry for “Electric Potential with Unusual Boundary 
Conditions” application problem. 


Moelter et al., [5] have shown that the appropriate boundary condition along the 
otaer two sides of the plate is 


Oz Oy’ 
where 4 is a parameter which depends upon Bo. 

Given the location of the non-Dirichlet boundary conditions, this problem 
will involve three types of grid points: those along the left side of the plate, those 
inserior to the plate, and those along the right side of the plate. The finite difference 
equation for the interior grid points is just the generic equation (1). For the grid 


points along the left side and the right side of the plate, the corresponding finite 
difference equations are 


—2wy,. — (1+ A)won-1 — (1 — A)wo,cga + 4wo,k = —h? fo,x (15) 


and 

—2wy ik — (1 — Awa — (L + Aw ke + una = —h? fie, (16) 
respectively. Let’s now take L = 2, H = 1, Vo = 1 and A = 0.25, and place a 
uniform 20 x 10 grid over the domain. With w = 1.64, 49 iterations of the SOR 


method produce ||w?—w* ||. < 5x 1078. The resulting equipotential curves (curves 
along which V is constant) are displayed in Figure 9.16. 
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EXERCISES 
1. (a) Derive equations (6), (7), and (8). 
(b) Assuming lexicographic ordering of the unknowns, what are the analogues 
of equations (6), (7), and (8) for the Gauss-Seidel method and the SOR 
method? 


2. (a) Derive equations (15) and (16). 
(b) What are the update equations corresponding to equations (15) and (16) 
for the SOR method? 


In Exercises 3-7, the boundary conditions for a Poisson problem over a rectangular 
domain are provided. For each set of boundary conditions 


(a) identify the different types of grid points; 
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(b) 
(c) 


3. 


7. 
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identify the finite difference equation for each type of grid point; and 
construct pseudocode to perform one relaxation sweep. 


Dirichlet conditions along the top edge and right side of the domain and Neumann 
conditions along the left side and bottom edge. To be specific, suppose 


u(x, y) = g(x,y) along top edge and right side, 


Ou : 

a aty) along left side, and 
Ou 

aa B(z) along bottom edge. 


. Neumann conditions along the top edge and right side of the domain and Robin 


conditions along the left side and bottom edge. To be specific, suppose 


Ou ; ; 
ae ay) along right side, 
a = f(a) along, top edge, 


o ; 
rily)5~ + a(yu=m(y) along left side, and 


polo) + qyp(x)u = rp(z) along bottom edge. 


. Dirichlet boundary conditions along the top and bottom edges, a Neumann 


boundary condition along the left side, and a Robin condition along the right 
side, To be specific, suppose 


u(x, y) = g(x,y) along top and bottom edges, 


=- =al(y) along left side, and 


8 . 
PW) 5 +a(yju=r(y) along right side. 


. Dirichlet boundary condition along the bottom edge, Neumann boundary condi- 


tions along the left side, and the top edge and a Robin condition along the right 
side. To be specific, suppose 


u(z,y) = 9(z,y) along bottom edge, 
Ou 


Bat a(y) along left side, 
n, 

oe = Bcc) along top edge, and 
p(y) + g(yju = r(y) along right side. 


Dirichlet boundary condition along the bottom edge, Neumann boundary condi- 
tions along the left side, and the right side and a Robin condition along the top 
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edge. To be specific, suppose 


u(z,y) =g(2,y) along bottom edge, 
Ou 


a a(y) along left side, 
Ou : : 
aa Bly) along right side, and 
6) 
P(e) 5 + g(x)u = r(z) along top edge. 


In Exercises 8-11, demonstrate that the Jacobi method and the Gauss-Seidel method 
require O(N?) sweeps and the SOR method O(N) sweeps to achieve a given level 
of convergence. Regardless of the type of boundary conditions, use equation (5) to 
compute w for the SOR method. 


Bu Ou 
. Bat + Bz ~OMR= Le wl0<e<Lo<y<]} 
1 y y 
0) = is = = 
u(z, } 0, ulz, ) (l+a)?+1’ u(0,y) T+y’ u(1, ¥) 4+ 
Ou @u 2. 2 
9. Bee aye +y on R={(z,y|0<2c<10<y<l1} 
a 
Fle) = me", u(a,1) = 52%, w(0,y) =sin(ry), u(t,y) =e sin(ay) +50" 
du, Oru 
1 _—7z _—_—__- = — = , 
0 Baa? Bye 52cos(4x + 6y) on R= {(z,ylO<e<10<y<} 
u(z,0) = cos(4z), u(x, 1) = cos(4z +6), u(0,y) = cos(6y), u(l,y) = 
cos(6y + 4) 
eu Fu 
11. i R={(z,y0<2<10<y< 1} 
en _ oy Ou _ 
u(z,0)=0, u(0,y) = Tae) ee ae for c= 1, 
Ou 2 1 
= for y=1 


ay Geeta” Gees 
12. A solid bar having a rectangular cross section one inch wide and two inches high 

is subjected to a constant rate of twist along its axis. The stresses established in 
the bar can be related to the nondimensionalized Prandtl stress function, p(z, y), 
which, for the geometry of this problem, satisfies the Poisson problem 

ey  &y 

Oa2 + By? =-—lon R 

w(z,y) = 0 on OR, 


where R= {(z,y)|0<2<1,0<y< 2}. Approximate w(z, y) using the finite 
difference method with a uniform mesh spacing of h = 0.10 inches. Use the SOR 
method to solve the finite difference equations. 


13. Rework the “Electric Potential with Unusual Boundary Conditions” application 
problem first with A = 0.05 and then with A = 0.5. 
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14. In the “Electric Potential with Unusual Boundary Conditions” application prob- 
lem, determine the equipotential curves when L = 1, H =3, Vo =2 and A= 0.3. 


In Exercises 15-22, we consider Jacobi and Gauss-Seidel line relaxation schemes. In 
a line relaxation scheme, the values of all of the unknowns along either a row or 4 
column of the grid are updated simultaneously. For more detail on these and other line 
relaxation schemes consult Thomas [2]. 


15. For Jacobi y-line relaxation, we rewrite the finite difference equation (1) as 
7 “ e 2 
— jib + 85,4 — Wp = Wi k-1 + W; kar — hfe 


and then solve for all of the unknowns along the kth row of the grid simulta- 
neously. This is referred to as y-line relaxation because the kth row of the grid 
corresponds to a fixed value of y. Jacobi x-line relaxation corresponds to solving 
for all of the unknowns along the jth column of the grid at the same time. 

(a) Construct an algorithm to perform Jacobi y-line relaxation. 

(b) Construct an algorithm to perform Jacobi z-line relaxation. 


(c) How many operations per point does Jacobi line relaxation require? How 
does this compare with the number of operations per point for Jacobi point 
relaxation? 


16. (a) Apply Jacobi z-line relaxation and Jacobi y-line relaxation to the problem 


=> =0 on R={(z,yl1<2<2,0<y<1} 


ee =2Inz, u(z,1) =In(2” +1) 
u(1,y) =In(y? +1), u(2,y) = In(y* +4), 
which was considered in the text. Use N x N grids with N = 5, 10, 20 and 40, 


and a convergence tolerance of 5 x 10711 


(b) How does the number of sweeps needed to achieve convergence vary with 
N? 


(c) How does the number of sweeps needed by line relaxation compare with the 
number of sweeps needed by point relaxation? 


17. Repeat Exercise 16 for the Poisson problem in Exercise 8. 


18. (a) Apply Jacobi z-line relaxation and Jacobi y-line relaxation to the problem 


2 2 
= + - = —[24 4°x(1 — x)] cos(zy) 
ula, 0} = u{x, 10) = 2(i — 2) 
u(0,y) = u(t,y) = 0. 


Take h = i and use a convergence tolerance of 5 x 10-7, Which scheme 
performs better? 
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(b) Apply Jacobi z-line relaxation and Jacobi y-line relaxation to the problem 


cage ae = —[24+7°2(1 — 2)] cos(ry) 


u(z,0)=a(1-2), u(x,1) = 2(z—-1) 
ul0,y) =0, u(l0,y) = ~90cos(ry). 


Take h = 4 and use a convergence tolerance of 5 x 107%. Which scheme 


performs better? 
(c) Comment on the results of parts (a) and (b). 


19. (a) Construct an algorithm to perform Gauss-Seidel y-line relaxation. 
(b) Construct an algorithm to perform Gauss-Seidel x-line relaxation. 
(c) How many operations per point does Gauss-Seidel line relaxation require? 
How does this compare with the number of operations per point for Gauss- 
Seidel point relaxation? 
20. Repeat Exercise 16 using Gauss-Seidel line relaxation. 
21. Repeat Exercise 17 using Gauss-Seidel line relaxation. 
22. Repeat Exercise 18 using Gauss-Seidel line relaxation. 


9.4 LOCAL MODE ANALYSIS OF RELAXATION AND THE MULTIGRID METHOD 


In the previous section, it was first observed numerically and then established the- 
oretically that the asymptotic convergence rate, w, for the Jacobi method, the 
Gauss-Seidel method and the SOR method increases when the mesh spacing, h, 
decreases. This result implies an inherent, and unfortunate, tradeoff between trun- 
cation error and the convergence rate of relaxation schemes. In this section, we will 
perform what is known as a local mode analysis of these relaxation schemes. In a 
local mode analysis, we decompose the iteration error into components which have 
different frequencies and then quantify the amount by which a single relaxation 
sweep reduces each of the components. Our objective is to obtain a better under- 
standing of the dependence of w upon h. Once this has been done, an algorithm 
for improving the convergence of relaxation schemes will be developed. 

Before we begin our discussions, it should be noted that the material in this 
section is due to Achi Brandt, who first presented the ideas in the paper “Multilevel 
Adaptive Solutions to Boundary Value Problems” [1]. 


Error Evolution Equation 


The first step in performing a local mode analysis is to derive the error evolution 
equation for the particular relaxation scheme. To be specific, let’s consider Gauss- 
Seidel relaxation with lexicographic ordering of the unknowns. With this choice, 
the equation 


W5j~1,k + Wj4ik + Wyk—1 + Wy kt1 —AWj ke op 
he = Fisk 
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Figure 9.17 Example of a smooth error component. Open circles de- 
note before sweep errors, and filled rectangles denote after sweep errors. 


defines the after sweep value W;,~. The corresponding finite difference equation is 


Wyk + Wyk + Wy kat + Wybe — 4k 


2 im Sik: 


Let &* = w — w* and o* = wh ~ w* denote the before sweep error and 
the after sweep error, respectively, in the relaxation solution to the finite difference 
equations. Subtracting the Gauss-Seidel equation from the finite difference equation 
and rearranging terms yields 


(Bj—ab b Byate + By,e-1 + 97,041) - (1) 


ml 


Ujjk = 


This is the error evolution equation for Gauss-Seidel relaxation with lexicographic 
ordering. It indicates that the new error at location (j,k) of the grid is merely the 
average of the current errors at the horizontally and vertically adjacent grid points. 

To assess the importance of this statement, we will distinguish between two 
types of error components: smooth and nonsmooth. An error component which is 
either slowly oscillating, or non-oscillating, relative to the computational grid, will 
be considered smooth. An example is illustrated in Figure 9.17. The open circles 
indicate the before sweep error, and the filled rectangles indicate the after sweep 
error. Very little reduction in the amplitude of the error has occurred. On the other 
hand, a non-smooth error component, one that oscillates rapidly relative to the 
grid, is illustrated in Figure 9.18, Again, open circles indicate before sweep error, 
and filled rectangles indicate after sweep error. Here, there has been a significant 
reduction in the amplitude of the error. 

From these figures we conclude that relaxation reduces the amplitude of 
smooth error components slowly, but reduces the amplitude of non-smooth error 
components more rapidly. Hence, relaxation is efficient at smoothing the error, but 
inefficient at solving the underlying equations. 


Transformation to Fourier Space 


Yo turn this qualitative conclusion into a quantitative one, we need to transform 
the problem from the physical space into the Fourier, or frequency, space. In 
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—O 


Figure 9.18 Example of a nonsmooth error component. Open cir- 
cles denote before sweep errors, and filled rectangles denote after sweep 
errors. 


the Fourier space, the dependence of the error on the frequency of oscillation is 
explicitly shown. Using frequency as a parameter will also provide a simple means 
for classifying error components as smooth or nonsmooth. 

Over the rectangular domain R = {(z,y)|a<a2<b,c<y< db}, the continu- 
ous Fourier basis functions are given by : 


Bplay) = exp {2mi fp (F=2) + a(4)]h. 


The wavenumbers p,q = 0, +1, +2, ... indicate the frequency of oscillation in the 
z-direction and the y-direction, respectively. Evaluating this basis function at the 
grid point (xj, ye), where 


x; =a+ jh for some j = 0,1,2,...,N, 
yk = c+ kh for some k = 0,1,2,...,M and 


yields 


(pj . ak 
Ep,q(&js Ye) = exp {ni ( + ryt . 


If we now define the discrete Fourier mode © = (0,02), where 


yy an 
=F and =F, 


then the discrete Fourier basis functions relative to the computational grid are given 
by 
Fo(%3, Yk) = exp {1791 + kO2)} - 
Since the complex exponential e is a 24-periodic function, it follows that the 


discrete Fourier basic functions Eg, Egi(2n,0), Fe+(o,2n), and Hox(on,en) all pro- 
duce identical values at every grid point. This is a phenomena known as aliasing. 
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cos(0*x) = 7 cOS(nx) cos(2mx) 
1 * * * * 1* ik * 
inne 
0.5} 0.5 | 05 
0 a * Q % * 
-0.5) 0.5; -0.5 
* 
= ——— —{|-——____————_* -1 —___» ____ 
Q 0.5 i is) 0.5 i 0 0.5 1 
cos(3nx) cos{4nx) cos (5nx) 
4 1% * * 1% 
* * 
as 0.5 05 
G ¥ 0 0 * 
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Figure 9.19 Dernonstration of the aliasing phenomena over a uniform 
partition of the interval [0,1] into four subintervals. Note that, on this 
grid, the functions cos(37a) and cos(Snx) cannot be differentiated. 


Every computational grid has a maximum frequency that can be resolved. Any 
function that oscillates more rapidly than this maximum frequency, when evalu- 
ated at the grid points, will be indistinguishabie from a grid function with a lower 
frequency. For our analysis, we will therefore restrict attention to the discrete 
frequency range —7 < 61,62 <7. 

As a concrete example of aliasing, consider Figure 9.19. Here, we have a uni- 
form partition of the interval [0,1] into four subintervals. The maximum frequency 
that this grid can resolve is 2~a function that alternates between maximum and 
minimum at successive grid points. Evaluated at the grid points, the functions 
1 = cos(Qrx), cos(1z), cos(2r72x), cos{3xxr), and cos(4ax), which have frequencies 
of 0, 1/2, 1, 3/2, and 2, appear as distinct grid functions. However, the function 
cos(Sxz), which has a frequency of 5/2, is identical to the function cos(37a). The 
erid cannot differentiate these functions. 


Local Made Analysis 


As the name implies, local mode analysis uses only local information. All boundary 
conditions are ignored. Therefore, this type of analysis is useful for local processes, 
like relaxation schemes. The outcomes from a local mode analysis are insight into 
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the mechanisms at work and quantitatively correct convergence information. 
Suppose that the before sweep and after sweep errors can be written as 


a’ =AgEg and t* =AgEo, 


for each discrete Fourier mode © = (0,62) with —m < 6;,02 <. The amplitudes, 
Ae and Ao, are associated with the discrete Fourier mode © and are independent 
of the spatial variables. We are interested in how well one sweep of the relaxation 
scheme reduces the amplitude of each component of the error, which is measured 
by the ratio of Ae to Ao. 


Definition. The CONVERGENCE Factor, (9), is given by 
u(0) = |4o/Ao| 


and measures the reduction in error amplitude produced by a single sweep of 
the relaxation scheme as a function of the discrete Fourier mode, O. 


Based on the analysis associated with Figures 9.17 and 9.18, we expect (0) 
to be small for high-frequency modes but close to unity for low-frequency modes. 

For Gauss-Seidel relaxation with lexicographic ordering, substitution of the 
discrete Fourier representation of the errors into the error evolution equation, (1), 
yields 


Agets0r+ke2) — 2 [4oe"* + Age + Age + Age™| e581 +hO2) | 
4 
Thus, the convergence factor for this scheme is 


eft + eff 
~|4—@ 101 — e102 |" 


Ag 


eOy= 
ucs( } Ao 


A contour plot of this function is shown in Figure 9.20. Note that around the 
outside of the plot, where at least one of the frequencies of the Fourier mode is high, 
cs © 0.1—0.4. Hence, error reduction for these modes is good. Unfortunately, 
when. both frequencies are low, cs * 1. In particular, for the smallest frequencies 


Qn Qn aah ath 
(0a) = (257-49) = (tp tz es) 
expanding the exponential functions in the formula for jzgg(©) in Taylor series leads 
to the estimate wgs(O) + 1 — O(h*). 

In summary, local mode analysis has shown us that Gauss-Seidel relaxation 
with lexicographic ordering does an excellent job reducing the amplitude of the 
high-frequency components of the error. Error reduction is poor only for the low- 
frequency components of the error, with the convergence factor ugg(@) ~ 1-O(h?) 
for the lowest frequencies. Local mode analysis of the Jacobi method and the SOR 
method is left for the exercises. 
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Figure 9.20 Contour plot of convergence factor for Gauss-Seidel re- 
Jaxation with lexicographic ordering. 


A Two-Grid Method 


Aithough relaxation does a poor job reducing the amplitude of the low frequency 
error components, the key observation to make is that these smooth error com- 
ponents do not need a mesh spacing of hf to be accurately resolved. They can be 
approximated just as well on a coarser mesh, one with a larger mesh spacing, with 
much less work involved. Furthermore, on a coarser mesh, some of the frequencies 
which are low relative to the fine grid will be high relative to the coarse grid. 

This suggests working with two grids, rather than just one. Let the first grid 
have a mesh spacing of h. We will refer to this as the fine grid. The fine grid will be 
used to approximate the solution of the partial differential equation, so the mesh 
size should be selected to provide the desired resolution. After several relaxation 
sweeps on the fine grid, the error, 0° = w* — w", will be smooth. At this point, we 
introduce a second grid, one with a mesh spacing of 2h. This coarse grid will be 
used to approximate the srnooth error from the fine grid, and this approximation 
will, in turn, serve as a correction to the fine grid solution. 

Now, if we are going to use the coarse grid to approximate o*, we have to 
know the equation that a" satisfies. Let £* denote the discrete Poisson operator, 
which is defined by the relation 
(Drut),, = tile + Up+ie + Uy ko1 + Uy kt 7 Aik 
zk 2 


for any grid function u®. Applying L* to the equation 0 
La" ci Ley — La = - al Lig", (2) 


A — ah — a yields 
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where we have used the fact that w” is the true solution of the discrete Poisson 
equation, so L*w* = f*. Defining r* = f* — L’w*, equation (2) becomes 

Lig? = rh, 
The grid function r” is called the residual. 

The basic two-grid solution algorithm consists of four steps. First, relax n, 
sweeps on the fine grid to produce an approximate solution w*. Second, construct 
the coarse grid problem to approximate 0”, the error associated with w*. The 
analogue of the equation L’0" =r” for a grid that has a mesh spacing of 2h is 

LPhgrh — pPhyh 


where I o represents the transfer of the residual from the fine grid to the coarse 
grid. Third, by some means, solve the coarse grid problem! for 02", Finally, correct 
the fine grid solution by adding I?, 6?" to w", where J’, represents the transfer of 
the correction from the coarse grid to the fine grid. Repeat these four steps until 
convergence is achieved on the fine grid. 

In the second step of the algorithm, we need to transfer the residual from the 
fine grid to the coarse grid. The easiest scheme for performing this task is known as 
injection: evaluate r” on the fine grid and copy values to the coarse grid at points 
that are common to both grids. Another grid transfer operator is needed in the 
fourth step. The transfer of the correction from the coarse grid to the fine grid is 
generally carried out using bilinear interpolation. Taking into account the uniform 
mesh spacing of the grids, the operator IA, is given by 


(Bn) 5 = 
OS fakf2 j mod 2= k mod 2 = 0 
0. 5G J 1)/2,k/2 +3 /2, k/2) jmod2=1, & mod2=0 
0.5(85} 7/2, k—1)/2 F U5, (kL /2) jmod2=0, kmod2=1 


_ a eis nyjace-n/2 tO Gray/ate-i2t 5 mod 2=k mod 2=1 
OG 1y/a,ceray2 + PG '41)/2,(eH1)/2 
where j and & run over the subscripts on the fine grid. 

What is the overall effect of this two-grid algorithm on the iteration error? To 
answer this question we must first recognize that because the mesh spacing on the 
coarse grid is double that of the fine grid, the error modes on the fine grid are divided 
into two groups: those for which |O| < 2/2 and those for which 7/2 < |©| < a, 
where |©| = max (|@1|, |92|)—see Figure 9.21. The former group consists of those 
modes which can be resolved on the coarse grid, while the latter group contains 
those modes whose frequencies are too high to be resolved on the coarse grid. 

The modes with |O| < 1/2 have their amplitudes reduced little by the relax- 
ation sweeps on the fine grid but are subsequently eliminated when the coarse grid 
problem is solved. This, of course, assumes that the grid transfers introduce no 
errors. As for the modes with 7/2 < || < a, if we let 


ji= max ©), 
e (a / EB cull 
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82 


Figure 9.21 Designation of Fourier modes as high frequency versus 
low frequency. 


then n, relaxation sweeps on the fine grid reduce the amplitudes of these modes by 
at least a factor of j7™ per cycle. Hence, the result of the two-grid process is the 
reduction of all error mode amplitudes by at least i"* per cycle, independent of h. 

The quantity ji is called the smoothing factor and measures the efficiency with 
which a relaxation scheme reduces the amplitude of high frequency error modes, 
For Gauss-Seidel relaxation with lexicographic ordering, if we were to zoom in on 
the center of Figure 9.20, we would find that the 0.5 contour touches the boundary 
between low and high frequencies (|O| = 7/2) and is the highest valued contour to 
do so, Therefore, figs = 1/2. 


Multigrid Method 


There is stil] one detail of the two-grid method which must be addressed, and that 
is the solution of the coarse grid problem in the third step. One choice is to repeat, 
the fine grid procedure. Perform n, relaxation sweeps on the coarse grid to reduce 
the amplitude of the error components that are high frequency relative to the coarse 
grid. Next, transfer the residual to a still coarser grid, one with a mesh spacing 
of 4h, to approximate the error components that remain low frequency with respect 
to the mesh spacing of 2h. We can then continue to introduce coarser and coarser 
grids, doubling the mesh spacing each time. Note that with each doubling of h, we 
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reduce the frequency resolution of the grid by a factor of two. Hence, half of the 
low frequencies on one grid become high frequencies relative to the next grid in the 
sequence. 

If the problem domain is square and the finest grid has a number of subin- 
tervals that is a power of 2, eventually the coarsening process will produce a grid 
with only two subdivisions. One relaxation sweep over this grid (which has only 
one unknown) is equivalent to solving the finite difference equation on that grid, 
Even if the domain is not square, but the number of subdivisions in each direction 
on the finest grid is still a power of 2, we will eventually reach a grid which has 
either one row or one column of unknowns. The system of equations on that grid 
can be easily solved using direct techniques. 

Once the problem on the coarsest grid has been solved, we work backward from 
the coarsest grid to the finest grid. Each step in this process requires interpolating 
the correction from the coarser of the two grids and adding it to the approximation 
on the finer grid. To achieve further reduction of the high-frequency error ampli- 
tudes, we can then perform ne relaxation sweeps on the finer grid. Figure 9.22 
provides a schematic for this multiple grid algorithm. For obvious reasons, this 
approach is known as the multigrid V-cycle. The multigrid method uses repeated 
applications of the V-cycle until convergence has been achieved. 

The cost of the V-cycle must be assessed in terms of both storage and work 
requirements. Suppose the finest grid contains P points. Since the coarse grids have 
been constructed by doubling the mesh spacing, each successive grid will contain 
one-quarter the number of points as the previous grid. The total number of points 
among all of the grids is then 


PP P ae i 
Scien ees ey oy fr ced ee ge eee ee 
Pa hgh gaa (14 toateet ) 


where ng denotes the number of grids. Therefore, storage for all of the grids requires 
less than 4/3 times the storage needed for the fine grid problem alone. 

To measure the amount of work used by the multigrid algorithm, we first 
make the following definition. 


Definition. One Work Unit is the computational cost. of one sweep over 
the finest grid. 


Since the cost of one sweep over a grid is proportional to the number of points 
on the grid, the total work required to complete one V-cycle is 


ni on n iene 1 4 
nee a ere en(late tet ge) <9 


aw 


work units, where n = 2, + nz is the total number of sweeps performed on 
each grid. For example, if ny = 2 and no = 1, then n = 3, and the total 
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Figure 9.22 Schematic for operation of the multigrid V-cycle. 


work per V-cycle is less than 4 work units. Thus, all of the sweeps over all of 
the additional grids amounts to less than one additional sweep over the finest 
grid. 

What do we get in return for these modest increases in storage and work? 
Well, the lowest frequency components of the error on the finest grid are elim- 
inated by solving the problem on the coarsest grid. Every other component of 
the error on the finest grid is high frequency on one of the grids, so the relax- 
ation sweeps reduce the amplitude of every other error component by at least ji”. 
Therefore, all components of the error are reduced by at least jf” per V-cycle, 
independent of h. Thus, even though each V-cycle requires a modest increase in 
the number of work units used, to achieve convergence, we expect to use signifi- 
cantly fewer work units than if we had worked only with relaxation over the finest 
grid. 
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EXAMPLE 9.6 Multigrid Performance 


Consider the Poisson problem 
—+ ay = —52 cos(4z + 6y) 


over the unit square, R = {(z,y)|0 <2 <1,0<y <1}, with the Dirichlet bound- 
ary conditions 


u(z,0) = cos(4z), u(z,1) = cos(4az + 6) 
u(0,y) =cos(6y), ull, y) = cos(6y + 4). 


The number of sweeps required by the Gauss-Seidel method and the SOR method 
for three different mesh spacings is summarized below. A convergence tolerance of 
5x 107§ was applied to |jw@"—w*||,.. The optimal value of the relaxation parameter 
was used in each case for the SOR method. 


h=l/8 h=1/16 h=1/32 
Gauss-Seidel 91 320 1100 
SOR with w = wopt 28 52 103 


Note that each time the mesh spacing is cut in half, the number of work units 
required by the Gauss-Seidel method increases by roughly a factor of 4. We now 
know that this is due to the fact that uqs & 1—O(h?) for the lowest frequency error 
components. On the other hand, the number of work units required by the SOR 
method increases by a factor of 2. In Exercise 2, we will see that usor * 1 — O(h) 
for the lowest frequency error components. 

For the same three values of h and the same convergence tolerance applied 
to |\w" — w* ||. after the final sweep on the finest grid, the convergence results 
for the multigrid method are summarized below. In each case ny = 2 relaxation 
sweeps were performed when moving from the finest grid to the coarsest grid, and 
M2 = 1 sweep was performed when working back to the finest grid. The performance 
measures listed in the tables of Table 9.1 are defined as 


convergence _ ||” — wD? IJoo 
per V-cycle || — w’||,o from previous cycle’ 
i/work units per V-cycle 
convergence _ / convergence 
per work unit ~ ( per V-cycle ) 
and 
total number — / number work units 
of work units ~ \ of cycles per V-cycle /’ 
where 
work units 1 1 1 
per V-eycle = (ny +72) (1 af fi + 16 he aes aN aa) 


and ng denotes the number of grids. 
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h = 1/8—Three Grids 


a" — Blo Convergence Factors 

Cycle After Final Sweep Per V-Cycle Per Work Unit 
1 1.550 x1073 

2 9.158 x1078 0.059100 0.487554 

3 3.349 x1074 0.036568 0.431591 

4 2.424 x1075 0.072391 0.513329 

5 1.036 x1076 0.042746 0.449046 

6 6.149 x1078 0.059339 0.488054 

7 3.838 x1079 0.062407 0.494342 


total number of work units used = 27.562500 


h = 1/16—Four Grids 


a” — lI 66 Convergence Factors 

Cycle After Final Sweep Per V-Cycle Per Work Unit 
1 1.533 x107! 

2 9.459 x1073 0.061689 0.497011 

3 5.090 x1074 0.053816 0.480268 

4 3.595 x1075 0.070621 0.514167 

5 3.064 x10-8 0.085225 0.539006 

6 2.497 x1077 0.081508 0.533007 

7 1.583 x1078 0.063400 0.500435 


total number of work units used = 27.890625 


h = 1/32—Five Grids 


\|ta" — 2" |, Convergence Factors 

Cycle After Final Sweep Per V-Cycle Per Work Unit 
1 1.477 x 107! 

9 1.136 x107? 0.076905 0.526280 

3 5.001 x1074 0.044012 0.457680 

4 3.420 x1075 0.068394 0.511058 

5 2.657 x107 0.077701 0.527638 

6 1.740 x1077 0.065483 0.505525 

7 1.573 x1078 0.090366 0.547956 


total number of work units used = 27.972656 


TABLE 9.1: Tables for Example “Multigrid performance’. 
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The important observation to make is that, regardless of h, the convergence 
per V-cycle and per work unit remains relatively constant. For less work than SOR 
needs to compute the solution with h = 1/8, the multigrid method computes the 
solution with h = 1/32. 


We have barely scratched the surface on the topic of multigrid methods. The 
reader interested in more detail should consult one of the following references. As 
noted at the start of this section, the seminal work on multigrid methods is the 
paper “Multilevel Adaptive Solutions to Boundary Value Problems” by Brandt [1]. 
This paper presents many of the fundamental concepts in a very readable format. 
Fulton, Ciesielski, and Schubert [2] also provide a very readable review of the basic 
multigrid concepts. The books by Briggs, Henson and McCormick [3], Hackbusch 
and Trottenberg [4], and McCormick [5] provide material on various aspects of 
multigrid theory and application. 
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EXERCISES 
i. Recall that the update equation for Jacobi relaxation is given by 
Wj-16 + Wierik + Wika t Wp kt1 — Aik 


7 = fir 


(a) Determine the error evolution equation for Jacobi relaxation. 
(b) Determine the formula for the convergence factor p34c(9). 
(c) Show that pyac(O) ¥ 1— O(h”) for the lowest non-zero frequencies 


Qn | on 2h Qnh 
ovale (ete) = (eB a) 


2. The update equation for the SOR method can be written in the form 


Wt j—1,k + WWy41,k + Wt; k-1 + Wj yt oa Aw = 1)W; x = Aw; k ah 
wh2 = Sak 
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Take w = Wopt, which, for the case of a uniform mesh over the unit square, is 


2 


ere ie sin(wh) 


(a) Determine the error evolution equation for the SOR method. 
(b) Determine the formula for the convergence factor Lsor(®). 
(c) Show that ugor(@) ~ 1 — O(h) for the lowest nonzero frequencies 


Qn Or Qah 2th 
(1,62) yw ~ (2a). 


(d) Estimate the smoothing factor igor. 


In Exercises 3-8, apply the multigrid method to each of the following Poisson problems 
over the unit square. Use mesh spacings of h = 1/8, h = 1/16, and h = 1/32. Take 
m, = 2 and ng = 1 and apply a convergence tolerance of 5 x 107° to |lw? — woo. 
Compute the convergence factor per V-cycle, the convergence factor per work unit, and 
the total number of work units expended. Compare the total number of work units 
to the number of iterations of the Gauss-Seidel method and the SOR method (using 
w= Wopt). 


3. a + oe = —2n" sin(rx)sin(ry), exact solution: u(z,y) = sin(a2) sin(zry) 
u(z,0) = u(z,1) = u(0,y) = u(1,y) =0 

4. ou + ee =0, exact solution: u(x, y) = qe 
u(z,0) = 0, u(2,3) = Gy gyrET u(0,y) = ees ule) =p 

5. a + - =r’ +y", — exact solution: u(x, y) =e” sin(zy) + 5 (0v)” 


1 “ get oe 1 
ula,0)=0, ula.) = 527, u(0,y) =sin(ay), u(1,y) = e sin (wy) + 59” 


- Dg2 + By? = —[2471%a(1 — 2)| cos(ry), 
exact solution: u(a, y) = x(1 — 2) cos(ry) 
u(z, 0) =a2(1—2), u(x, 1) = «(x — 1), u(0,y) =u(1,y) =0 
* O22 — Oy? 
u(z,0)=u(0,y)=1, ulz,1)=e™*, u(l,y) =e? 
2 2 24,2 
8. Ou + lee ae exact solution: u(x, y) = In(zy) 
Ore Ay? my? 
u(z,1)=Ina, ufs,2)=In(2z), u(l,y)=Iny, u(2,y) = In2y) 


= (27 +y")e"™, exact solution: u(x, y) =e” 


Section 9.4 Local Mode Analysis of Relaxation and the Multigrid Method 785 


(a) (b) 
Uniform Uniform —*> 
inflow outflow outflow 
10 m/s 20 m/s 5 m/s 
Uniform 
— inflow 


Figure 9.23 (a) Flow chamber for Exercise 9. (b) Flow chamber for 
Exercise 10. 


9, Consider incompressible, irrotational flow through the chamber shown in Fig- 
ure 9.23(a). The chamber is a square, one meter on a side, and the flow openings 
all have a width of 0.1 meters. 

(a) Set up the boundary value problem for the stream function for this flow 
using the “Flow Through a Contraction Duct” problem capsule as a guide 
(see page 726). 

(b) Solve the boundary value problem from part (a) and plot the resulting 
streamlines. 


10. Repeat Exercise 9 for incompressible, irrotational flow through the chamber in 
Figure 9.23(b). The chamber is a square with side length of one meter. The 
opening on, the left side has a width of 0.2 meters, and the bottom edge of the 
opening is at the midpoint of the side. The outflow openings each have a width 
of 0.1 meters. 

11. A thin square plate is placed in a horizontal position and simply supported 
along its perimeter. A distributed load gq is applied to the upper surface. The 
deflection, w(x, y), of the plate from the horizontal satisfies (see Timoshenko and 
Woinowsky- Krieger [6]) 

Ofw aty Otw qg 


Gat * “Oa? * Byt ~ D 


subject to the boundary conditions 


a 
w=0 and —=0 along the perimeter. 


On 
The constant D is known as the flexural rigidity and is given by 


Et? 


D= 08)" 
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where F is Young’s modulus, t is the plate thickness, and o is Poisson’s ratio. If 
i + 2 vehi 

we introduce the variable u = oy + ae then the original fourth-order problem 

can be replaced by the two Poisson problems 


Au Ou q : 

a2 + ae =p us Q along the perimeter 
and 

Bw + aw ah w=0 along the peri 

Dae Dy : perimeter. 


Suppose the plate is 60 inches on a side and g = 1 psi (pound per square inch), 
FE = 30 x 10° psi, o = 0.27, and t = 0.15 inches. With N = 64 on the finest 
grid, approximate w(z, y). 


12. Repeat Exercise 11 but replace the uniform distributed load by 


1 
ax, 4) = 360000 7089 — z)y(60 — y). 
13. Gauss-Seidel relaxation with red-black ordering of the unknowns (recall Exer- 
cise 13 of Section 9.1) can be programmed as follows: 
for pass = 1 to 2 
ks = pass 
forj7 =ltoN 
fork = ks to N by 2 
compute %(j, k) 
end 
ks =3—ks 
end 
end 


This code relaxes the red unknowns on the first pass and then the black unknowns 
on the second pass. 

In the multigrid method, replace Gauss-Seidel relaxation with lexicographic or- 
dering by Gauss-Seidel relaxation with red-black ordering. Also replace injection 
of the residual by half injection, in which the residual is multiplied by one-half 
before being copied to the coarser grid. With these modifications, resolve the 
Poisson problem from Exercise 3 with h = 1/32. What is the convergence factor 
per V-cycle? Use this value to estimate the smoothing factor for Gauss-Seidel 
relaxation with red-black ordering. Recall that in theory, 


convergence _ _n 
per V-cycle ~” ' 


9.5 IRREGULAR DOMAINS 


Thus far, we have worked exclusively with problems defined over rectangular do- 
mains. Real problems, unfortunately, often involve irregularly shaped domains. To 
finish out this chapter, we will therefore discuss the solution of the Poisson problem 
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over nonrectangular domains. No attempt will be made to consider every possible 
configuration. Rather, several specific cases will be examined. These few examples, 
however, should be sufficient to allow for the investigation of a wide range of dif- 
ferent domain geometries. The section will conclude with a general treatment of 
circular domains. 


Sloped Boundaries 


EXAMPLE 9.7 Flow through a Contraction Duct 


In the Chapter 9 Overview (see page 726), we showed that the stream function, 
(x,y), for incompressible, irrotational flow through the asymmetric contraction 
duct in Figure 9.1 satisfies the Laplace equation 


oy Fy _ 
O22 © Ay2 


subject to the boundary conditions 
~(0,y) = 3y, OS y <2 
3 
03,9) =12y-1), Lsys5 


w(z,y) =6, along the upper wall 
v(z,y) =0, along the lower wall. 


Because the walls of the duct slope at an angle of 45°, it is possible to select 
the mesh spacing so that the grid follows the walls exactly. For instance, any mesh 
spacing of the form h = = for some positive integer NV will suffice. The grid 
corresponding to N = 2 is shown in Figure 9.24. 

The major difference between this problem and those which were treated in 
earlier sections is that here, the rows of the grid do not contain the same number of 
unknowns. In particular, with h = Wr the computational grid will have 4N—1 rows 
of unknowns. With the sloped lower boundary, the rows numbered k = | through 
k = 2N each contain 2N +k—1 unknowns. The rows numbered & = 2N +1 
through k = 3.V — 1 each have 6N — 1 unknowns. Finally, with the sloped upper 
boundary, the rows numbered &k = 3N through k = 4N — 1 each have 6N —k-1 
unknowns. 

To approximate y(z,y), a mesh spacing of h = 0.1 meters was selected. The 
system of finite difference equations was solved using Gauss-Seidel relaxation with 
a. convergence tolerance of 5 x 107’. The streamlines for the flow are plotted in 
Figure 9.25. 


For any angle of inclination other than 45°, the only way for the computational 
grid to follow the entire boundary is to allow for a different mesh spacing in the 
x-direction than in the y-direction. This, of course, means that we must rederive 
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Figure 9.24 Grid corresponding to N = 2 for Example 9.7. 


Figure 9.25 Streamlines for flow in an asymmetric contraction duct 
with 45° contraction angles. The values of three of the streamlines are 
given, and the increment between streamlines is Ay = 0.6. 
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the template for the discrete Poisson problem to account for a non-uniform mesh. 
To do this, let Av denote the spacing between grid points in the x-direction and Ay 
denote the spacing between grid points in the y-direction. At the arbitrary interior 
grid point (2; , Yk) 

Pu wrk — Qwik + Wy41k 


Aa? ~ (Ar)? | 


and : 
PU Wik — 2Wj,k + Wy k+1 
dy? (Ay)? ’ 


so the finite difference equation takes the form 


Wj-1k ~ 2Wyk + Wit. k 
(Az)? 


W5k-1 — 2W5,k + Wy kt 
(Ay)? 


The truncation error associated with this equation is second-order in both Az 
and Ay. Multiplying through by —(Az)? and defining \ = (Az)?/(Ay)? yields 


as = fin. 


<Wy 1k — W7p1k — AW; R-1 — AW; ee, +2 + Ale = —(Az)? fx. (1) 


We will demonstrate the use of this formula in the next example, where we also 
include a non-Dirichlet boundary condition along the sloped portion of the bound- 
ary. 


EXAMPLE 9.8 Temperature Distribution in a Bar of Trapezoidal Cross 
Section 


A long metallic bar has a trapezoidal cross section. The two parallel sides are 
maintained at constant temperatures, while the remaining sides are insulated. The 
temperature distribution within the bar, T(z, y), satisfies Laplace’s equation 


eT eT 
aa ta = 0. 
Oa? Oy? 

The specific geometry of the domain and the boundary conditions are shown 
in Figure 9.26. Note that the angled top edge of the domain has a rise of 20 mm 
over a run of 30 mm. Hence, the slope is 2/3. For the computational grid to follow 
along this portion of the boundary, we will need to select Ay and Az in the same 
ratio. 

At all interior grid points, the appropriate finite difference equation is the one 
we derived previously: 


Wy—1,k ~ Wyk — AW; e—1 — AW; nea + 21 + A)Wi ke = (AZ)? fy, 


where A = (Az)*/(Ay)*. With a non-Dirichlet boundary condition along the bot- 
tom edge of the domain, a separate equation must be derived for the unknowns 
along the first row of the grid. When the computational template, equation (1), 
is placed along the first row, each w;,-1 will be a fictitious node. The Neumann 
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aT/on = 0 


40 mm] T = 100°C 
T=0°C 20mm 


30 mm 


aT/dn = 0 


Figure 9.26 Geometry and boundary conditions for Example 9.8. 


boundary condition along the bottom of the bar implies that, to second-order, 
Wy,k-1 = Wz,k41; therefore, the equation for the unknowns along the first row of 
the grid is 


=Wj-1,k — Wi+1,k — 2AW;n41 + 2(1 + A)w;ye = —(Az)? fie. 


A separate equation is also needed for the unknowns along the top edge of 
the domain. Consider the following diagram. 


w, O 


Placing the template at (x;, yx) produces the equation 
—Wr — Wyk — Myke — Awe + 21+ ANw;e = —(Az)? firs 


where wy, and wr are fictitious nodes. Since the slope of the boundary is Ay/Az 
and the outward normal points up and to the left, it follows that 
a 


C6) a] 
Se -j Pa — = —Ay— + Ar—. 
n Ay-i+Agz-j and On Ava, + or 


The boundary condition along the top edge of the domain then yields 


Wj+1k — WL wT ~ Wj,k-1 
& , A -0. 
DY okg ay 
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40 


30 


n 
a 


y. millimeters 
y 
oO 


\ 
15 
X, millimeters. 


25 30 


Figure 9.27 Temperature contours for a metallic bar with trapezoidal 
cross section. The values of three of the contours are given, and the 
increment between contours is AT = 10°C, 


Multiplying through by —2(Aa/Ay) and rearranging terms gives 
-Wr- AwP = —Wi4+1,k — NW5,k-1: 


Therefore, for the unknowns along the upper boundary, the appropriate finite dif- 
ference equation is 


—Qwjeik — 2AWsh-1 + 201 + A)wy = —(Az)* fit 


A contour plot of the temperature distribution within the bar is shown in 
Figure 9.27. For this calculation, the spacing between grid points was taken to be 
Az = 1.5 mm and Ay = 1.0 mm. The system of finite difference equations was 
solved using Gauss-Seidel relaxation. A convergence tolerance of 5 x 10~® applied 
to ||w* — 2" ||,9 was used to terminate iteration. 


Curved Boundaries 


Whereas sloped boundaries can generally be handled with some extra bookkeeping, 
curved boundaries are particularly messy with finite differences. To keep matters 
as simple as possible, we will only consider the case where Dirichlet conditions 
are specified along a curved boundary. Near the boundary, there is no way to 
maintain a uniform grid spacing. In the worst case, the distances from w;,, to its 
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Figure 9,28 Example where distance from four neighboring grid 
points are all different. 


four neighboring grid points would all be different, as indicated in Figure 9.28. To 
be able to handle this situation, we must determine the form of the five-point star 
template. 

If we expand %;~1,4 and %;+41,4 into Taylor series with the appropriate incre- 
ment in a, as indicated in the diagram, this leads to the approximation 


Ou 2 1 ‘i 
aa ee i aes Ww; 
Ox? h2 | mp(my +mpz) aoht MLMR ak ma(mr +mpr) 


wie 5 (2) 
Proceeding in a similar fashion with u;,-1 and uj,n+1, we find 


dtu 2 
dy? h? 


seh 8) 


1 
Wi k= — FY 
) cd MaBmr 2 mr(Mp +mp 


mp(mp + mp 


Combining these last two expressions, the generic interior grid point computational 
template for a nonuniform grid is found to be 


1 1 
i os clin = =  _ 
Frcs +mp) ? \* " ma(mp +m) 


] 1 1 
Wy keel ~ ( + ) wa = ik) 


mr(ms +mr) ME_EMR MmBMT 


Wyte + Wj,k-1 


mp(mg+mr) 


Not only is this formula cumbersome, it is only first order in h. Only when mz = 
mp = Mg = mp = 1 does this formula reduce to the standard template and regain 
second-order accuracy. 


ey 


EXAMPLE 9.9 Torsion of a Bar with a Curved Boundary 


Suppose that a long bar whose cross section is indicated in Figure 9.29 is subjected 
to a constant rate of twist along its length. The stresses established in the bar 
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can be related to the nondimensionalized Prandtl stress function, (x,y), which 
satisfies the Poisson problem 


Oey Ory 7 
Ox? ' Ay? ~ 
with Dirichlet boundary condition b(#, y) = 0 along the entire boundary. 


-1, 


Acc of an 
ellipse 


Figure 9.29 Cross section of long bar with curved boundary for Ex- 
ample 9.9. 


Since the smallest. dimension of the domain is 1/4, let’s choose h = 1/(4.N) for 
some positive integer NV. The computational grid will then contain 4N — 1 rows of 
unknowns. With Dirichlet conditions specified around the entire boundary, every 
unknown is located at an interior grid point. Therefore, the system of finite differ- 
ence equations will be constructed from the standard five-point star template and 
equation (4), only. No other finite difference equations will have to be developed. 

Assuming lexicographic ordering of the unknowns, rows numbered k = 1 
through k = 2N — 1 will contain at least N unknowns, though the actual number 
will vary from row to row and will depend on the z-coordinate of the intersection 
between the elliptical arc and y = yy. Hf the origin is located at the bottom left 
corner of the domain, then the equation of the ellipse is 


16(x — 1)? + 36y? = 9, 
and the intersection between the ellipse and y = y, occurs at. 
1 
Le =1—- iv? — 36(kh)?. 


The number of unknowns on row k, for 1 <k <2N —1, is then 


N+ |(G- 3vo= 5) jr). 
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Figure 9.30 Contours of the Prandtl stress function for a bar with an 
elliptical arc as part of its boundary. The values associated with several 
contours are given, and the increment between contours is 0.005. 


Rows numbered k = 2.NV through k = 4N—1 of the grid all have the full complement 
of unknowns (4N — 1). 

Which finite difference equation is used for which unknowns? For k > 2N 
and all 7, as well as for k < 2N and j < N, which correspond to grid points that 
are guaranteed not to interact with the elliptical boundary, the standard five-point 
star template is applied. For all other unknowns, we apply equation (4). From the 
geometry of the domain, we see that m, and mr are always equal to one, but maz 
and mp may be less than one. In particular, mg and mg are given by the formulas 


mr =min(1, (teu — 2)/h) and ms = min(1, (ye — Yar)/h), 


where 


1 
Yeu = ev 9 16(jh — 1)?. 


The approximate Prandtl stress function was computed for N = 6; that is, 
with h = 1/24. The contours of are plotted in Figure 9.30. The maximum value 
of the stress function was 0.041307, which occurred at + = 5/12 and y = 2/3. The 
finite difference equations were solved using the SOR method with a relaxation pa- 
rameter of w = 1.7. A convergence tolerance of 5x 1078 was applied to ||w* — w"\Jeo 
to terminate iteration. 
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Circular Domains 


As the previous three examples have illustrated, applying the finite difference 
method to solve a partial differential equation over a non-rectangular domain, 
especially one with curved boundaries, requires significantly more effort and/or 
bookkeeping than solving an equation defined over a rectangular domain. The one 
exception to this rule is a partial differential equation on a circular domain. For a 
problem with a circular domain, we can use polar coordinates to ensure that the 
grid follows the entire boundary. 

The first thing that must be done is to convert the Laplacian operator, L = 
3? /dx? + 8? /dy*, into polar coordinates. Recall that the Cartesian coordinates x 
and y are related to the polar coordinates r and @ by the equations 


— = 2 2 
z=rcosd and TTY 


y=rsing 6 = arctan(y/x)" 
After some tedious manipulation involving the chain rule, it can be shown that 


on ea a! Ci 
Or2° 7 Ar |r? O62 


Therefore, the Poisson problem in polar coordinates is 


LS 


Ou lau 1 u 
Sa ee og aa 
Or2 or Orr? 06 
To formulate the computational template, suppose the domain is rinner < 7 < 

Touter aNd Ostart <4 < Gena. Let Ar = (Touter —Tinner)/N and Aé = (Bend —9etart)/M 
for some positive integers N and M, and let r; = rinner +jAr and @% = Ostart + 
kAQ. Following the standard procedure for developing finite difference equations, 
we arrive at the generic interior grid point equation 


Wj-1,k — 2Wik + Wi+1,k 4: 1 wit — Wyk | 1 Wy e-1 ~ 205k + Wy et 


(ar? 7 DN 7 )) i 
which is second order in both Ar and AQ. If we multiply through by (Ar)?, define 
= (Ar)?/(A6)? and group like terms, the template becomes 


Ar Ar Xr aN Xr 
1- 2rj Wj-1ket lit Or; Wi+i,k+ rp ie a 3 Wi ke1— 2) 1+ Wyk 
= (Ar) fin (5) 


This is the polar five-point star. 


EXAMPLE 9.10 Torsion of Quarter Round Molding 


A strip of quarter-round molding is subjected to a constant rate of twist along its 
length. The nondimensional Prandtl stress function, ~(r,@), satisfies the Poisson 


3 
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1 


Figure 9.31 Contour plot of nondimensional Prandtl stress function 
for a strip of quarter-round molding. 


problem 


Pp 1dy 10 yp _ 
Or? r Orr? 86? 
(0, 8) = P(1, @) =0 
w(r,0) = P(r, 7/2) = 0. 


—l, O<r<l0<e<t 


Since this problem has Dirichlet conditions around the entire boundary, every un- 
known is an interior grid point. The system of finite difference equations then 
consists solely of equation (5) for j = 1,2,3,..., N—1 and k = 1,2,3,...,M@—1. 
Let’s take N = M = 64 and use the SOR method with w = 1.9 and a convergence 
tolerance of 5 x 10-7 applied to ||" — "||. The value for w was selected by 
trial and error. To simplify programming, we will represent the single point r — 0 
by the collection of boundary grid points (0,0;) for k = 0,1,2,...,M, all initialized 
with the value zero. A contour plot of the computed approximate stress function 
is shown in Figure 9.31. 
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EXAMPLE 9.11 An Annular Region with a Robin Boundary Condition 


Consider the Poisson problem 


PT AF TOP 2 
ar? r Or — r? O62 
T(0.5,8) = 20 


(1 + 0.5 cos 36)T(2, 6) + 7 (2, é)=0. 


-1 


This problem could be used to model the temperature distribution within a hollow, 
thick-shelled tube with internal heat generation. Fluid flowing through the inner 
core of the tube maintains an inner surface temperature 20 degrees higher than the 
ambient temperature surrounding the tube. The convective heat transfer coefficient 
varies about its mean value around the circumference of the tube. 

This problem has two twists. First, note that there are no boundary conditions 
specified for 6. Periodicity of the solution in @ is implied. Suppose the interval 
[0, 27] is divided into M uniformly sized pieces. When the template, equation (5), 
is applied for k = 0, periodicity requires that we use the value wj,4¢_1 for wj,-1. 
Similarly, when the template is applied for k = M — 1, we use the value w;9 in 
place of w;,x- 

The second twist is the Robin boundary condition specified at the outer radius 
of the annulus. Suppose the radial interval is divided into N uniformly sized pieces. 
When the template is applied for 7 = N, the value wn41,_ will be a fictitious 
node, say wpe. The finite difference equation associated with the Robin boundary 
condition is 


~ WN-1,k 


w 
(1 + 0.5 cos 34% wa, + i ae =0. 


Solving this equation for wr and substituting the result into (5) yields 


aN 
2Wv—1k + a WN,R-1 + MN eH 
TN TN 


A 2 
—2 ¢ + z + (ar + ep ) (1+ 0.508364) wwe = (Ar)? fk 
TN 2rn 


as the finite difference equation for the unknowns along the outer boundary. 

To approximate T(r, 6), let’s take N = 12 and M = 32 and use the SOR 
method to solve the complete system of finite difference equations. The value 
w = 1.8 was selected for the relaxation parameter by trial and error. A convergence 
tolerance of 5 x 1077 was used to terminate relaxation sweeps. A contour plot of 
the computed temperature distribution is shown in Figure 9.32. 
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EXERCISES 


1. 
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2-—— 


Figure 9.32 Temperature contours for annular region with a Robin 
boundary condition along the outer surface. The inner radius corre- 
sponds to the contour J = 20. Radiating outward, the contours decre- 
ment by 2. 


Derive equations (2) and (3), and show that each is only first-order accurate in 
the parameter h. 


. Approximate the solution of the Poisson problem 


au 1du. 1 du cos 6 T 
plied re gcr 
Or? r Or 1? O62 re! et ea ee 
=i : 
u(1,A)=0, wu(3,é)= = cos é, u(r,0) = Z ad u(r, 7/2) = 0. 


. Approximate the solution of 


Ou 1du, 1 eu 
Or? ° r Or © r* 80? 
u(1,6)=1—cos#, u(2,0)=ulr,0)=0, ulr,r) =4- 2r, 


=0, l<r<2, 0<@<r 


. A coaxial cable consists of a 0.1-inch-square inner conductor, maintained at a 


potential of 0 volts, and a 0.5 inch square outer conductor, maintained at a 
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potential of 110 volts. The potential, ¢(x,y), within the cross section of the 
cable is governed by the partial differential equation 


Given the symmetry of the cable and the potentials on the inner and outer 
conductors, ¢ can be determined by solving the above equation over the non- 


rectangular domain shown in Figure 9.33. The relevant boundary conditions are 
indicated in the diagram. 


@=1l0V 


Figure 9.33 Diagram for Exercise 4. 


5. A metallic plate has uniformly spaced V-shaped grooves milled into its upper 
surface. The upper surface, including the grooves, is maintained at a temper- 
ature of 200° C, while the bottom surface is held at 20°C. The temperature 
distribution within the plate satisfies the Laplace equation. Taking into account 
all available symmetries, the domain and the appropriate boundary conditions 
are as indicated in Figure 9.34. 


8T/On = 0 


dan = 0 0.04 m 


0.08 m 


T= 20°C 
Figure 9.34 Diagram for Exercise 5. 
6. A uniform flow of velocity 6 meters/second enters a contraction duct that con- 


tracts symmetrically along both walls and exits as a uniform flow with a velocity 
of 15 meters/second. The duct has a one-meter-long entry section with a width 


800 


10. 
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of two meters, a one-half-meter contraction section that contracts 0.6 meters 
on each wall, and a one-meter-long exit section. The stream function, #(z, y), 
satisfies the Laplace equation with boundary conditions 


(a, y) = 0 along the bottom wall, 
‘ia, y) = 12 along the upper wall, 
w{0,y) =6y, O< y < 2, and 
(2.5, y) = 15(y — 0.6), 06<y<14. 


. A rod with a cross section in the shape of an equilateral triangle is subjected 


to a constant rate of twist along its length. The nondimensional Prandtl stress 
function, (2, y), satisfies the Poisson problem 


on the domain as shown in Figure 9.36 subject to the indicated boundary con- 
ditions. 


. Astrip of half-round molding is subjected to a constant rate of twist along its 


length. The nondimensional Prandtl stress function, (7, 9), satisfies the Poisson 
problem 


rp 1p 18 _ 
ar?  r Or or? O62 — 


10, 8) = (1,6) = tr, 0) = FE (0 n/2) = 0, 


1, 0<r<lo<e<t 


Note that symmetry has been applied to cut the domain at 6 = 1/2. 


. A rod with a cross section as indicated in Figure 9.37 is subjected to a con- 


stant rateof twist along its length. The nondimensional Prandtl! stress function, 
~(z, y), satisfies the Poisson problem 


On2 " By? ~ 


with w(x, y) = 0 around the entire boundary. 
A 100-mm-thick slab of granite has uniformly spaced heating pipes running 
through it. The pipes have a diameter of 50 mm and are centered in the thickness 
of the slab. The exposed surfaces of the slab are maintained at 300 K, and the 
quid passing through the heating pipes maintains a temperature of 400 K at the 
interface between the pipes and the granite. Taking into account all symmetries, 
the temperature distribution within the slab satisfies 

OT OT. 

oat * By? = 
over the domain indicated in Figure 9.38. The appropriate boundary conditions 
are also shown in the diagram. 


Section 9.5 


on 


6 racters / second 15 meiers / second 
——_—_—_»> ——_—_> 


a 


Figure 9.35 Diagram for Exercise 6. 


12) w=0 


dy/an=0 


Figure 9.36 Diagram for Exercise 7. 


Figure 9.37 Diagram for Exercise 9. 


T=300 K 


oT/dn = 0 | 25 mm 


OT/dn = 


dTidn = 0 


Figure 9.38 Diagram for Exercise 10. 
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CHAPTER 10 


Parabolic Partial Differential 
Equations 


AN OVERVIEW 
Fundamental Mathematical Problem 


In this chapter we will develop the finite difference method to approximate solutions 


of equations of the form 
Ou 8 ott its 
Pa Ba \“ Os 


in one spatial dimension, or 


du 8 Ou a] du 
ae oe (a3) + By (a5) — Puts 
in two spatial dimensions. Here, the parameter p represents a storage capacity, a 
measures conductivity or transmissivity, @ is a decay rate, and s is a source term. 
In general, each parameter could be a function of any or all of the independent 
variables. 

Equations of the type indicated above are called parabolic partial differen- 
tial equations. The most basic parabolic partial differential equation is the heat 
equation 

du Ou 

Ot Oa? 
where D is called the coefficient of diffusion. As we will see throughout this chapter, 
despite its name, the heat equation appears as a model for many different physical 
phenomena. 

The “Rise in the Water Table due to the Spring Thaw” problem capsule from 
the Chapter 1 Overvew (see page 8) is one application that gives rise to a parabolic 
partial differential equation. Here is another application. 


Time-Dependent Temperature in a Fin 


Recall that, in thermodynamic terms, a fin is any surface that extends from a 
larger object and is intended to enhance the dissipation of heat from that object. 
In the Chapter 8 Overview, we developed a model for the steady-state temperature 
within a fin of circular cross section and variable cross-sectional area (see page 656). 
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x=0 Ax RSP 


Figure 10.1 


Here, we will develop a model for the time-dependent temperature within a fin with 
constant cross-sectional area. 

The geometry for our problem is depicted in Figure 10.1. The fin has length L 
and a circular cross section with constant radius r. Distance along the axis of the 
fin will be measaured by the variable xz, with 2 = 0 corresponding to the location 
at which the fin is attached to the larger object. We will denote the temperature 
at any location xz and at any time t by T(z, t). 

To proceed with the development of our model, consider the arbitrary slice 
of the fin indicated in the diagram. Conservation of energy (specifically thermal 
energy for the current problem) requires that 


dE 
dt 


where Ey is the energy stored in the slice and Ej, and Four are the rates at which 
energy enters and exits the slice, respectively. The total energy stored in the slice 
is given by 


— Bin = Fouts (1) 


Bye = pepVT, 


where p is the mass density and c, is the specific heat of the material from which 
the fin is constructed and V = mr?Az is the volume of the slice. If we assume p 
and ¢p are constant, then 


dE x, a OP 
= —. 2 
aie ac uidloeey (2) 
Following the procedure described in the Chapter 8 Overview, we find that 
2,0°T 
Ein — Fout =P kaahe — 2nr Az h(T — Too), (3) 


where k is the thermal conductivity of the fin, h is the convection heat transfer 
coefficient between the fin and the surrounding air, and T,, is the temperature of 
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the air. The first term in (3) accounts for conduction along the axis of the fin, while 
the second term accounts for convection from the lateral surface. 
Substituting (2) and (3) into (1) and then dividing by pcprr*Az yields 


0T k OT 2h 
Be ee a (4) 
Ot lp O2 PCy? 
We will assume that the temperature throughout the fin is initially equal to Tao 
and that the temperature at z = 0 is known for all time; that is, 


T(z, 0) = Too and T(0,£) = F(é) (5) 


for some function f. Any heat conducted to the exposed tip at « = L is dissipated 


by convection, so 
ee | ener = 7S) (6) 
On |p : ee 
Equations (4)-(6) comprise what is called an initial boundary value problem for 
determining T (a, t). 


Remainder of the Chapter 


We will begin our investigation of the numerical solution of parabolic partial dif- 
ferential equations with the heat equation in one dimension subject to Dirichlet 
boundary conditions. Three separate finite difference techniques will be developed. 
Section 10.2 will focus on the important issue of stability. A detailed analysis of 
the schemes developed in Section 10.1 will be presented. Parabolic problems more 
general than the heat equation will be treated in Section 10.3, and non-Dirichlet 
boundary conditions will be considered in Section 10.4. In Section 10.5, parabolic 
partial differential equations in polar coordinates will be discusséd. The alternating 
direction implicit (ADI) scheme for problems in two dimensions will be presented 
in the final section. 


10.1 THE HEAT EQUATION WITH DIRICHLET BOUNDARY CONDITIONS 


Consider the initial boundary value problem 


2 
Lay Aa ree ee 
Ox? 
IBVP ¢ u(A,t) = ual(t) (1) 
B,t) = ua(t) 


The diffusion coefficient, D, is assumed to be constant. This problem can serve as a 
model for heat conduction, soil consolidation, groundwater flow, and so on. Based 
on the geometry of the domain, as pictured in Figure 10.2, this problem is often 
referred to as a one-end open problem. In this section, three separate formulations 
of the finite difference method will be developed for equation (1). 
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4p(*) 


Bx 


A u(x, 0) = f(x) 


Figure 10.2 Geometry of domain for model initial boundary value 
problem. 


Spatial Discretization 


The discretization of (1) proceeds in two stages. First we discretize the space 
variable, then the time variable. To achieve the spatial discretization, let Ar = 
(B — A)/N and 2; = A+ jAz for j = 0,1,2,...,N, where N is some positive 
integer. Next, evaluate the partial differential equation at the arbitrary interior 


grid point 2 = 2;: 
Ou Au 
Fe =D =} 
Replace the derivative on the right-hand side of the equation by its second-order 


central difference approximation, and then drop the truncation error term. If we let 
v(t) denote the semidiscrete approzimation to u(x;,t), the functions v,(t) satisfy 


eazy 


dus(t) __vj—1(t) — 2ug(t) + vj41(t) 
“a ee (om ) 


for 1 <7 < N-—1. These equations are supplemented by the boundary conditions 
vo(t) = ua(t) and uy(t) = up(t). The initial values for the v;(t) are given by 
0; (0) = f (x4). 
The equations for the v;(¢) can easily be written in matrix form. Let 
v(t)=[ u(t) volt) vt) - ~~ walt)’, 
7 
f=[ f(zi) f(w2) ffs) - - + flen-a)]. 
b(t)=[—ua(t) 0 - - - 0 —us(t) |” 
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and = e 
2 —l 
-1 2 -4I 
-1 2 -l 
A= 
-l1 2 -1 
LL ~l 2 a 
With these definitions, equation (2) becomes 
dv(t) D 7 
<a RE [Av(t) + b(t], v(0) =f. (3) 


This is a semidiscrete system of initial value problems that are an order O ((Az)?) 
approximation to the original partial differential equation. 


Temporal Discretization 


There are many techniques available for solving the equations represented by (3). 
Since (3) is a system of initial value problems, any of the methods that were de- 
veloped in Chapter 7 could be used. For example, we might use the classical 
fourth-order Runge-Kutta method, a predictor/corrector technique or the RKF45 
adaptive scheme. Instead, we will focus on the use of finite difference schemes 
for solving equation (3). Even under this heading, there are several possibilities 
available, of which we will develop three. 

To complete the discretization of equation (1), we divide the time axis into 
uniform steps of length Aé. Let t, = nAt for n = 0,1,2,... and so on. Values 
for the approximate solution will be obtained at these discrete time levels. For the 
fully discrete approximation, we will use the notation 


wl)” 5 05 (tn) © U(a;, tn). 
Note the subscript on w indicates spatial location along the grid, while the super- 
script indicates the time level. 

For our first method, let’s evaluate equation (3) at the nth time level, t = tn, 
and use a first-order forward difference approximation for the time derivative. Al- 
ternatively, we can integrate both sides of equation (3) fromt = t, tot =t,41 and 
use a left endpoint approximation for the integral on the right-hand side. Following 
either procedure, we arrive at 


with) — wir) _ D 
At (Az)? 
y 


where 
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and b') = b(tp). Since a first-order difference formula was used for the time 
derivative, this equation is first order accurate in At, The overall truncation error 
between wy” and u(xj, tn) is then O ((Az)* + At), or first order in time and second 
order in space. 


Solving the previous equation for wi") yields 
wit!) (7 — \Aywl) — yp, (4) 


where A = DAt/(Az)*. This is an explicit equation for computing the approximate 
solution at one time level from the values of the approximate solution at the previous 
time level. With w known from the initial conditions, we can march forward one 
increment of Ag at a time. The matrix J — AA is called the evolution matréz for 
the numerical method. This explicit method for solving the model initial boundary 
value problem, equation {1), is known as the forward in time/central in space, or 
FTCS, method. 


EXAMPLE 10.1 Demonstration of the FTCS Method 


Consider the imitial boundary value problem 


du 1 Bu 
Ot 16 Az? 
u(0,t) = u(1,#) =0 
ulx, 0) = 2sin(27a). 


For this problem, D = 1/16, ua(t}) = uplt) = 0 and f(s) = 2sin(2rz). To 
construct our finite difference approximation, let’s take Az = At = 0.25. Then 


_ (2/16)(0.28) _ 
A= gage = ee 


With this value for 4, the evolution matrix for the FTCS method is 


0.5 0.25 0 
J-dA=] 0.25 O05 0.25 
O 0.25 0.5 


Since ua{t) = ua(t) = 0, the vector b(") = 0 for all n. Hence, equation (4) reduces 
to 


0.5 0.25 0 
wh) — 1995 0.5 0.25 | wl, 
0 025 0.5 


The spatial grid consists of the points xo = 0, # = 0.25, t2 = 0.5, £3 = 0.75, 
and x4 = 1. Our calculations therefore start from the vector 


w= [ f(a) flee) flrs) 7 =[2 @ -2] 


Tv 
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For the first time step we compute 


05 0.25 0 2 1 
w) = |] 0.25 O85 0.25 0o}=| 0 
0 0.25 0.5 ~2 = 


The second time step produces 
0.5 0.25 0 1 0.5 
w') = | 0.25 0.5 0.25 of= o |, 
0 0.25 0.5 
while the third and fourth yield 


0.5 0.25 0 0.5 0.25 
w=] 0.25 O05 0.25 0 = 0 
0 025 05 —0.5 —0.25 
and 
0.5 0.25 O 0.25 0.125 
w=] 0.25 0.5 0.25 0 = 0 
0 O25 05 -0.25 0.125 


The exact solution for this problem is u(z,t) = 2e7(* /4)' sin(2mz). The following 
table compares the values of the approximate solution at t = 1, vw, with the 
values of the exact solution, u(z;,1) = u;(1). These errors are not unreasonable 
given the crudeness of the discretization in both time and space. 


Approximate Exact 
x; Solution, wi) Solution, u;(1) Absolute Error 
0.00 0.060 0.000000 
0.25 0.125 0.169610 0.044610 
0.50 0.000 0.000000 0.000000 
0.75 —0.125 —0.169610 0.044610 
1.00 0.000 0.000000 


A numerical verification of the second order spatial accuracy of the scheme is 
presented in the next table. Both the maximum absolute error and the root mean 
square (rms) error in the approximate solution at ¢ = 1, as a function of the number 
of subintervals, N, into which the spatial axis has been divided, are displayed. The 
rms error was computed according to the formula 


ce 
tms error = 4| 5 Che Su). 
3=1 
To guarantee that we observe the full effect on the approximation error each time 
we change Az, At must be chosen small enough so that the temporal truncation 
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error is much less than the spatial truncation error. The value At = 1074 was used 
for all calculations. Note that with each doubling of N, the step size was cut in 
half, and the approximation error was reduced by roughly a factor of 4, which is 
what one would expect from a second-order numerical method. 


N Maximum Absolute Error Error Ratio rms Error Error Ratio 


4 0.101006 0.071422 

8 0.022388 4.511733 0.015830 4.511733 
16 0.005384 4.158237 0.003807 4.158237 
32 0.001296 4.152887 0.000917 4.152887 
64 0.000285 4.553531 0.000201 4.553531 


Numerically verifying the first-order temporal accuracy of the FTCS method 
is very difficult. To guarantee that we observe the full effect of any change in At, 
we want to choose Az small enough so that the spatial truncation error is much 
less than the temporal truncation error. For the FTCS method, however, we must 
choose At < (Az)*/(2D). The reason for this will be discussed in the next section. 
Unfortunately, with the time step restricted in this manner, the temporal truncation 
error will always be of the same order or smaller than the spatial truncation error. 


The backward in time/central in space, or BTCS, method is obtained by 
evaluating equation (3) at the (n+ 1)-st time level, ¢ = thi, and using a first-order 
backward difference approximation for the time derivative. Alternatively, we can 
integrate both sides of equation (3) from t = ty, to t = tay and use a right endpoint 
approximation for the integral on the right-hand side. Following either procedure 
leads to 


w) =f, 
where L 
w= Pal? af wf ws | 


and b("+)) = b(t,4). Like the FTCS method, this scheme is first order in time 
and second order in space. 
Rearranging terms in the evolution equation, we obtain 


(I + A)wt) = wi) — yp) (5) 


where \ = DAt/(Ax)*. From here, we see that the BTCS method defines witt1) 
implicitly in terms of w\”), Bach time step therefore requires the solution of a 
linear system of equations. Since I +A is a tridiagonal matrix, computing w(t) 
from equation (5) requires only about twice as many algebraic operations as com- 
puting w'"+)) from equation (4). The benefit derived from the additional effort 
will be explored in the next section. 
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EXAMPLE 10.2 Demonstration of the BTCS Method 
Let’s once again consider the initial boundary value problem 
ou 1 eu 
at «16 Gx? 
u(0,t) = u(1,t) =0 
u(x, 0) = 2sin(27z). 
With Az = Az = 0.25, the method parameters become \ = 1/4 and 


16 ~0.25 0 
P+AA=} -0.25 1.5 —0.25 ]. 
0 —0.25 1.5 


Since u,(t) = ug(t) = 0, the vector b+” = 0 for all n. Hence, equation (5) 
reduces to 


15 —0,.25 0 
0 —0.25 1.5 


“095: hs “035 | wn) =n 


Starting from w) = [2 0 ~2 le in the first time step, we must solve 
15  -0.25 0 2 4/3 
-0.25 15 -0.25 |wY=]| 0 SS ae) Or Ie 
0 -025 1.5 =o hfs 
Thus, in the second time step, we are faced with the linear system 
15  -0.25 0 4/3 8/9 
0.25 15 0.25 |w2=] 0 SWS SO. he 
0 —0.25 1.5 —4/3 —8/9 
The third and fourth time steps produce the results 


16/27 32/81 
wi?) — 0 and w) = 0 ; 
-16/27 —32/81 


Comparing the values of the approximate solution at t = 1 with the exact solution, 
we see that the approximation error for the BT'CS method, for this problem, is 
roughly five times that of the FTCS method. 


Approximate Exact 
x; Solution, wi) Solution, u;(1) Absolute Error 
0.00 0.000000 0.000000 
0.25 0.395062 0.169610 0.225452 
0.50 0.000000 0.000000 0.000000 
0.75 —0.395062 —0.169610 0.225452 


1.00 0.000000 0.000000 
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The next table demonstrates the second-order spatial accuracy of the BTCS 
method. All calculations were performed with At = 5 x 1075. Note that each 
time Az was cut in half, both the maximum absolute error and the rms error were 
reduced by roughly a factor of 4. 


Az Maximum Absolute Error Error Ratio rms Error Error Ratio 


1/4 0.101088 0.071480 

1/8 (}.022467 4.499466 0.015886 4.499466 
1/16 0.005462 AALB411 0.003862 4.113411 
1/32 0.001374 3.975153 0.000972 3.975153 
1/64 0.000362 3.793634 0.000256 3.793634 


In contrast to the FTCS method, the BTCS method places no restrictions 
on the size of At. It is therefore an easy task to verify the first-order temporal 
accuracy of the scheme. With Ar = 1/256, we obtain the following results. The 
first-order nature of the approximation is apparent. 


At Maximum Absolute Error Error Ratio rms Error Error Ratio 
0.008 0.004147 0.002933 

0.004 0.002085 1.988911 —° 0.001474 1.988911 
0.002 0.001053 1.979563 0.000745 1.979563 
0.001 0.000537 1.960657 0.000380 1.960657 


0.0005 0.000279 1.924624 0.000197 1.924624 


The third, and final, method that we will develop is the Crank-Nicolson 
scheme, Integrate both sides of equation (3) from t = ty to t = ty41, and ap- 
proximate the integral on the right-hand side with the trapezoidal rule. This yields 


Dat 
(Fl) eG) 2 ae ) 4 5) 4 (AwOt) 4 pity 
w w (As)? [caw +b”) + (Aw + )). 
or , 
(I+ AA WO) = (T— Aw — (bd 4 BOTY), (6) 


where \ = DAt/|2(Az)?|. This is another implicit method, requiring the solution 
of a tridiagonal linear system at each time step. The computational cost associ- 
ated with implementing equation (6) is roughly 50% larger than that associated 
with equation (5). However, since the trapezoidal rule is a second-order numerical 
integration scheme, the overall truncation error for the Crank-Nicolson method is 
O ((Ax)? + (At)”}—second order in both time and space. 


a 


EXAMPLE 10.3 Demonstration of the Crank-Nicolson Scheme 
We once again turn to our sample initial boundary value problem 
du 1 Gu 
at 16 Ox? 
u(0,t) = u(1,t) =0 
u(x, 0) = 2sin(2mz). 
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With Ag = At = 0.25, 
4, G1 4/4) 


aya 7 VS 
so 
1.25 —0,125 0 
I+X\A= | —0.125 1.25 —0.125 
0 —0.125 1.25 
and 


0.75 0.125 0 
I-XA=] 0.125 0.75 0.125 |. 
0 0.125 0.75 


The evolution equation for the approximate solution is then 


1.25 0,125 0 0.75 0.125 oO 
—0.125 1.25 0.125 |w?t)= | 0.125 0.75 0.125 | wo 
0 -0.125 1.25 0 0.125 0.75 


since both b™ and b!+)) are zero for all n. 7 
Starting from the initial condition w® = [2 0 -2 ]", in the first time 
step, we must solve the system 


1.25 0.125 0 0.75 0.125 0 2 
—0.125 1.25 0.125 | wi) =| 0.125 0.75 0.125 0 


0 ~0.125 1.25 OQ 0.125 0.75 ~2 


1.5 
=| o |, 
-1.5 


which yields w{) = [ io rte ihe The next three time steps produce 


w?) [0.72 0 -0.72 y 
wi) = [ 0.482 0 -0.432 |" and 
w(4) =[ 0.2592 0 —0.2592 ]". 
Comparing the values of the approximate solution at ¢ = 1 with the exact solu- 


tion, we see that the approximation error for the Crank-Nicolson scheme, for this 
problem, is roughly twice that of the FTCS method. 


Approximate Exact 
z; Solution, wi Solution, u;(1) Absolute Error 
0.00 0.000000 0.000000 
0.25 0.259200 0.169610 0.089590 
0.50 0.000000 0.000000 0.000000 
0.75 0.259200 —0.169610 0.089590 


1.00 0.000000 0.000000 
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The next table presents a numerical verification of the second-order spatial 
accuracy of the scheme. For all calculations, At = 5 x 107? was used. Note that for 
each Az, the errors achieved by the Crank-Nicolson scheme are comparable with 
those of both the FTCS and the BTCS method, even with a time step that is two 
orders of magnitude larger. 


Az Maximum Absolute Error Error Ratio rms Error Error Ratio 


1/4 0.101060 0.071460 

1/8 0.022439 4.503678 0.015867 4.503678 
1/16 0.005435 4.128699 0.003843 4.128699 
1/32 0.001347 4.034036 0.000953 4.034036 
1/64 0.000336 4.015649 0.000237 4.015649 


The Crank-Nicolson scheme also places no restrictions on the size of At. With 
Az = 1/1000, we obtain the following results, which clearly verify the second-order 
temporal accuracy of the method. 


At Maximum Absolute Error Error Ratio rms Error Error Ratio 


1/5 0.008591 0.006074 

1/10 0.002128 4.037015 0.001505 4.037015 
1/20 0.000530 4.016473 0.000375 4.016473 
1/40 0.000131 4.033626 0.000093 4.033626 
1/80 0.000032 4.130451 0.000022 4.130451 


Application Problem: Rise in the Water Table due to the Spring Thaw 


In the Chapter 1 Overview (see page 8), we developed the initial boundary value 
problem 

Oh ah 

at Ox?” 

h(z,0)=Ro(x), h(0,t)=hr(t) and  A(800,t) = hr(f) 

for determining the water table, h(z,t), in an aquifer. The aquifer is situated 
between two monitoring wells located 800 meters apart. The constant a is called 
the hydraulic diffusivity, which has been experimentally determined to be 


a = 0.0059 m?/s = 509.76 m?/day. 


Suppose that during the spring thaw measurements made at the well at the 
left edge of the aquifer indicate 


Deer?) £80 
tel { i ” P40, 


while the measurements made at the other well indicate 


hp(t) = 0.31 (t)- 
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Change in water table, meters 


9 100 200 300 400 500 600 700 800 
Distance from monitoring well at Jeft edge of aquifer, meters 


Figure 10.3 Change in the water table in an aquifer during the spring 
thaw. 


Let’s take ho(z) = 0 so we can determine the change in the water table due to the 
spring thaw. Figure 10.3 displays the resulting water table profiles after 15 days, 30 
days, 45 days, and 60 days. All solutions were obtained using the Crank-Nicolson 
scheme with Az = 20 meters and At = 0.2 days. 


EXERCISES 


In Exercises 1~4, numerically verify that 


(a) the FTCS method is second-order accurate in space; 
(b) the BTCS method is first-order accurate in time and second-order accurate in 
space; and 
(c) the Crank-Nicolson scheme is second-order accurate in both time and space 
by approximating the solution of 
Ot Ax? 
subject to the indicated initial and boundary conditions. 
1. u(0,t)=1, ulijt}=0, u(x,0) =1-—2—- 4sin(2rz) 
exact solution: u(x,t) =1—a—2e7*" 'sin(2rz) 
2. u(0,t) =ul(a,t)=0, u(e,0) =sing 
exact solution: ufx,t) = e7sinz 
3. u(0,t) = u(7,t)=0, u(x,0) =sin? x 
alition: — 3e-bs 1-98 @ 
exact solution: u(x,t) = ze" sine — ze" sin(3z) 
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4. u(0,t) = ulm, t) = 3(1+e7*), u(a,0) = cos? x 
exact solution: u(z,t) = 4 + de~** cos(2a) 


5. Let Ax = 1/20. Approximate the solution of 


8 de 
ae u(0,t)=1, ufl,t)=0, ulz,0)=1—2—- ; Sin(2mz) 


att = 1 using the FICS method with the indicated number of time steps. In each 
case calculate the corresponding value of \ and plot the resulting approximate 
solution. 


(a) 800 time steps 
(b) 777 time steps 
(c) 776 time steps 
(d) 775 time steps 


6. Plot u(x,t) for ¢ = 0.05, t = 0.1, t = 0.25, and t = 1, where u(z, ¢) is the solution 
to the initial boundary value problem 


Ou Ou 


ae eae u(0,f)=1, u(l,t)=5, ulz,0)=0. 


Take Ax = 1/20 for all calculations. 


7. Plot u(x,t) fort =1,1 = 2,t= 3, andt =4, where u{s;, t) is the solution to the 
initial boundary value problem 


2 
2 = at u(0,t)=1l-e7', u(l,t}=1—cos(rt), u(x, 0) =0. 


Take Az = 1/20 for all calculations. 


8. Consider flow between two flat parallel walls separated by a distance h. The fluid 
and both walls are initially at rest. At t = 0, flow is initiated by impulsively 
bringing the lower wall, corresponding to y = 0, to the constant velocity uo. The 
velocity profile between the walls, u(y, t), satisfies 


Ou Ou 


OL bas Bye? uly, 0) = 0, u(0, t) = uo, u(h, t) = Q, 


where v is the kinematic viscosity of the fluid. If we introduce the nondimensional 
variables 
U=u/u, Y=y/h, and T= tu/h’, 


the problem becomes 


au _ #U 
oT” Ay?’ 


Plot U(Y,T) for T = 0.05, T = 0.1, T = 0.25, and T = 1. 


U(¥,0)=0, U(0,T)=1, UQ,T)=0. 


9. 


10. 


li. 
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Soil consolidation is the hydrodynamic process by which water is expelled from 
saturated soil voids when the soil is compacted. If the layer underneath the 
compressible soil has a higher permeability than the soil, then as the water is 
expelled, the pore water pressure, ¢(x, t), satisfies 


az? $(z, 0) = Ad, $(—Hz,t) =; $( Ha, t) =0, 


where Cy is the coefficient of consolidation, Hy is the maximum drainage path, 
and A¢@ is the change in pressure due to the compacting force. Introducing the 
nondimensional variables 


&b=¢/A¢, Z=2/Hg, and T =tCy/Hi, 
the problem becomes 
a _ oe 
OT = 0Z?’ 
Plot 6(Z,T) for T = 0.05, T = 0.1, T = 0.25, and T = 1. 


An aquifer is located between two rivers, and fluctuations in the water table are 
monitored at two wells located 1100 meters apart. During a flood, the rise in the 
water table as measured at both wells was found to be 


»(t) -{ (5/3)t, t<3 


be (F-9)/5 tg 3? 


®(Z,0)=1, (-1,7) = 0(1,T) =0. 


where r is measured in meters, and ¢ is measured in days. The change in the 
water table, h{x,t), as a result of the flood is modeled by the initial boundary 
value problem 


ah _ oh 
at Ox?’ 
The hydraulic diffusivity of the soil has been experimentally determined to be 


@ = 0.0059 m/s = 509.76 m?/day. 


A(z,0)=0, A(O,t) = (1100, t) = r(t). 


(a) Determine A(z, t) at the peak of the flood, t = 3. 

(b) Plot A(a,t) fori = 10, = 15, and t = 20. 

(a) Assuming that w is sufficiently differentiable, show that the truncation error 
for the FTCS method is 


D> At 1 \ fu 
2 ( 7 ay) Oxt 
where A = DAt/(Aa)?. 
(b) Show that for \ = é the leading term in the truncation error for the FTCS 
method is O((At)? + (Az)*). 
(c) Numerically verify that the FTCS is fourth order in space when A = 


1 

6 

by approximating the solution of the initial boundary value problem from 
Exercise 5. 


+ O((At)? + (Az)*), 
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10.2 ABSOLUTE STABILITY 


When approximating the solution of an elliptic partial differential equation by the 
finite difference method, our main concern was finding an efficient procedure for 
solving the discrete system of equations. When approximating the solution of a 
parabolic partial differential equation, however, our main analysis issue will be 
the absolute stability of the time discretization scheme applied to the semidiscrete 
system of equations [e.g., equation (3) from Section 10.1]. Recall from Section 7.9 
that a time discretization scheme is said to be absolutely stable when the asymptotic 
character of the solution it produces matches that of the analytical solution. For 
the differential equations treated in this chapter, this definition translates to the 
requirement that the component of the approximate solution attributable to the 
initial conditions should not increase as time steps are computed. 

We will suppose the mesh spacing Az has been selected small enough to 
resolve any spatial variations in the solution. Once this has been done, our objective 
will be to determine what restrictions, if any, need to be placed upon the choice 
of At to ensure absolute stability. The ideal situation would be that no restrictions 
were needed, so At could be chosen solely on the basis of accuracy considerations. 
Methods for which no restrictions are imposed on the choice of At are said to be 
unconditionally stable. Methods for which an upper bound is imposed upon the 
value of At are called conditionally stable. An unconditionally unstable method is 
one for which no value of Aé will maintain absolute stability. 

There are three basic approaches to performing stability analysis. Suppose 
the time discretization scheme that has been applied to the semidiscrete system can 
be identified as one of the methods developed in Chapter 7. We can then examine 
the region of absolute stability for that method, together with the eigenvalues of 
the coefficient matrix for the semidiscrete system, to determine whether or not At 
needs to be restricted. Alternatively, we can work from the fully discrete system 
of equations and perform either a matrix stability analysis or a von Neumann 
stability analysis. Since we have been focusing on expressing each method in matrix 
form, we will develop the matrix approach first. The von Neumann approach will 
be presented toward the end of the section. Stability analysis based on methods 
developed in Chapter 7 will be considered in the exercises. 


Matrix Stability Analysis 


In principle, each of the methods we have developed—the FTCS method, the BTCS 
method, and the Crank-Nicolson method—can be written in the form 


wth) = By + ol), 


for some evolution matrix E. The vector ce!” incorporates the boundary conditions. 
Let w®) denote the vector obtained from the initial conditions. Then 


w(t) = By 4. (0) 
w) = Bw +4 = Bw +4 (ce + Be) 
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w) = Bw?) + 6) = Bw + (cl) 4 Be) + E%¢() 


wh) — Bw) 4 lm) = Amy) 4 @ 


where @ = cl") 4. Fel™—2) 4. Hre(m—3) 4... 4 M10 (0), 

For absolute stability, ||Z”™w || must be less than or equal to |/w()|| for 
any initial vector w0. From here, it follows that we need ||E”|| < 1 for the 
corresponding natural matrix norm of £”. Using the fact that p(E) < ||£'™|| and 
that p(E™) = [p(#)), we can conclude that p(#) < 1 is a necessary condition for 
absolute stability. Hence, matrix stability analysis requires an examination of the 
eigenvalues of the evolution matrix. 

The following special tridiagonal matrix will play a central role in the analysis 
of the FTCS method, the BTCS method, and the Crank-Nicolson scheme: 


ac 
b a 
b 


b 


ac 
bal 
This matrix is defined by just three numbers—the value a that appears at each 
location along the main diagonal, the value 6 that appears along the principal sub- 
diagonal, and the value c that appears along the principal superdiagonal. Assuming 
that M is an N x N matrix, its eigenvalues are given by 


kn 
= 2QVvb 1 
be =at Vbecos (1) 
fork =1,2,3,...,N. 


Forward in Time 
The evolution matrix for the FTCS method is 
f 1-2) r 


A 1-2 X 
r 1-2r A 


Epros = IT - AA = , Secs , 
A 1-2 r 
A 1— 2A 
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where A= DAt/(Az)?. This matrix matches the pattern of the special matrix M 
with . 
a@=1~—2 and b=c= 4. 


It follows from equation (1), then, that the eigenvalues of Errcs are given by 
ka 
= 1-2 +4 2A cos — 
Lk cos N 


kr 
=e 2 eos 
a( con) (2) 
=1-4)sin? a 


for k = 1,2,3,...,N —1. In going from the second line to the third, the half-angle 
formula for sin? 6 has been used. 

The FTCS method will be stable provided p(Eprcs) < 1. From equation (2), 
this will happen when the compound inequality 


kn 
~1<1—~4)sin? — < 
< 4A sin an =} 


holds for all k. Since A is strictly positive by construction, the inequality on the 
right side is trivially satisfied. For the inequality on the left, note that the sine 
function has a maximum value of 1, so the quantity between the inequalities has a 
minimum value of ] — 4A. Requiring 1— 4A to be greater than or equal to —1 leads 
to A < 1/2. Hence, the FTCS method is conditionally stable. Once Av has been 
selected, we must choose : 
Ag) 
Ats< mY) 


to maintain absolute stability. 


EXAMPLE 10.4 The Conditional Stability of the FTCS Method 


Let’s once again examine the initial boundary value problem 


du 12 
dt 16 Az?’ 


The problem fixes the value of D at 1/16. Suppose we choose Ax = 1/40. Stability 
then requires that At be chosen to satisfy 


u(0,¢) = u(1,t)=0, u(z,0) = 2sin(2rz). 


(Ax)? 1/1600 1 
: < = = ‘ 
AtS “SD 30/16) ~ 200 


Figure 10.4 displays the approximate solution to the test problem at ¢ = 1, 
computed using the FTCS method with different values of At, and therefore with 
different values of \. The solution in the top graph was computed with A¢ = 1/200, 
while the solution in the middle graph was computed with At = 1/182. The 
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4x5 140, At= 1/200, 150.5 


u(%1) 


u{x.1) 


05 —— 
Ax= 1/40, At= 1/180, A= 5/9 
= | jw “ve A/ 
x 0 NA 
a 
05 | a | 
Q 03 02 0a aa 05 0.6 0.7 0.8 0.9 i 


Figure 10.4 Approximate solutions to the initial-boundary value 
problem 
ou iil Ou 
Bt 16 Oz?’ 
u(0,t) = u(1,4)=0, u(z,0) = 2sin(27z) 


computed using the FTCS method. Solutions at t = 1 are shown. 


solution in the bottom graph was computed with At = 1/180. These time step 
values correspond to \ = 1/2, \ + 0.549 and A = 5/9, respectively. 

With the cutoff value of At = 1/200 (A = 1/2), the computed approximate 
solution behaves as expected. The approximate solution compares favorably with 
the exact solution, u(z, 1) = 2e7” */4 sin(2mx), and there are no signs of instability. 
Increasing the time step by less than 10%, to At = 1/182 (A = 0.549), the solution 
begins to show the tell-tale signs of instability. Note the sawtooth-like oscillations 
in the middle graph. If additional time steps were performed with this value of At, 
the quality of the solution would degrade rapidly. Finally, taking At = 1/180 
(\ = 5/9), we observe full-blown instability. (Compare the middle and bottom 
graphs of Figure 10.4 with the bottom graph of Figure 7.21.) 

For 1/200 < At < 1/182, the solution curves at t = 1 do not appear to indicate 
that the FTCS method is unstable, even though these values correspond to 4 values 
which are larger than 0.5. This, however, is not a contradiction of our matrix 
analysis. With time step values in the range 1/200 < At < 1/182, the amplification 
due to repeated multiplication by Errcs is smal] enough so that instability is not 
yet observed at t = 1. Were the solutions to be advanced beyond t = 1, eventually, 
the instability would become apparent. This will be explored in Exercise 1. 
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Backward in Time 
The evolution matrix for the BTCS method is 


Exrcs = 
1424 —-) et 
—-X 142A —-d 
-A 142d —-A 
(I+\A)71 = 
—-\ 142) -xr 
—X 142 


where \ = DAt/(Ax)?. The matrix J + \A matches the pattern of the special 
matrix M with 
@=1+2X and b=c=-d. 


It follows from equation (1), then, that the eigenvalues of I + AA are given by 


kn 
He = 1+ 2A + 2d cos = 


kr 
=142A{1 — 
+ ( + cos | (3) 


a q kn 
= 1+4Acos oN 
for k = 1,2,3,...,N—1. 

From equation (3), we see that zz > 1 for all k. Let 7, denote the eigenvalues 
of Eprcs. Since Eprcs = (I+ AA)7}, Te = 1/jee. With py, > 1 for all k, it follows 
that 0 < m™ <1 for all k. Hence p(Fgrcs) < 1, regardless of the value of A, and 
the BTCS method is unconditionally stable. This is the payoff we get in return for 
the extra work of solving a linear system of equations at each time step. 


EXAMPLE 10.5 Unconditional Stability of the BTCS Method 


Figure 10.5 displays approximate solutions to the initial boundary value problem 


1 2 
a = ane u(0,¢) = u(1,t)=0, u(#,0) = 2sin(27z). 
computed using the BTCS method with Az = 1/40 and At = 1/200 (top graph), 
At = 1/100 (middle graph), and At = 1/10 (bottom graph). The corresponding 
values of A are A = 1/2, X = 1, and A = 10. Even with \ as large as 10, the 
computed solution shows no signs of instability. 
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02 T t T =I v t tT T T af 
Ax= 1/40, dt= 1/200,120.5 


0 
x 
alt ax= 040, At= 1/10.2= 10.0 ] 
3 
2b 4 


0.t 0.2 0.3 o4 05 06 07 038 09 1 


Figure 10.5 Approximate solutions to the initial boundary value prob- 


lem 
ou 1 &u 


Ot (16 Oz?’ 


u(0,t) = u(l,t) = 0, 
u(x, 0) = 2sin(2rz) 


computed using the BTCS method. Solutions at t = 1 are shown. 


Crank-Nicolson Scheme 


The evolution matrix for the Crank-Nicolson scheme is given by 
Eon = (I + AA)M(I - AA) 
= EprcsErrcs, 
where \ = DAt/ [2(Az)?]. To determine the eigenvalues of Eon, note that 
(1+ XA) + (I — AA) = 21. 
Premultiplying this equation by (I + \A)7! yields 
PEI +A) =A) = aT AA), 


or 
Eon = —1 + 2(1 + A)7}. (4) 
If 7 are the eigenvalues of Fon and py, are the eigenvalues of J + AA, then equa- 
tion (4) implies that 
Te = ~14+2/pr. 
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From our work with the BTCS method, we already know that ux, > 1 for 
all &. This relation implies that, for all k, 


1 
0<— <1, 
Lk 
and then ; 
0<— <2, 
Lr 
Finally, 


eG a ed 2 
Uk 
Hence p(Zcn) < 1, regardless of the value of 4, The Crank-Nicolson scheme, 
like the BTCS method, is unconditionally stable. With unconditional stability and 
second-order accuracy in both time and space, the Crank-Nicolson scheme is con- 
sidered the method of choice for one-dimensional diffusion problems. 


EXAMPLE 10.6 Unconditional Stability of the Crank-Nicolson Scheme 


One last time we turn to the initial boundary value problem 


2 

ou = ns u(0,é) = u(l,t)=0, ule, 0) = 2sin(2Qrz). 
Figure 10.6 displays approximate solutions to this problem computed using the 
Crank-Nicolson scheme with Az = 1/40 and At = 1/200 (top graph), At = 1/100 
(middle graph), and At = 1/10 (bottom graph). The corresponding values of A are 
A= 1/2, \ = 1 and A = 10. Even with A as large as 10, the computed solution 
shows no signs of instability. 


von Neumann Stability Analysis 


Von Neumann stability analysis is very similar to the local mode analysis we used 
in Chapter 9. We assume that the solution to the finite difference equations can be 
expressed in terms of the Fourier components 


wl” = pret), (5) 


The amplitude coefficient, 7, is called the amplification factor. As n increases (more 
time steps are computed), we want the amiplitude of this mode of the solution to 
be bounded. In other words, we require the amplification factor satisfy 


-l<r<l 


for a real-valued amplitude, or 
rr <i 
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Figure 10.6 Approximate solutions to the initial-boundary value 
problem 

dus Ou 

3t 16 Ox?’ 


u(0, t) = u(L,t) = 0, 

u(z,0) = 2sin(2rx)} 
computed using the Crank-Nicolson scheme. Solutions at t = 1 are 
shown, 


for a complex-valued amplitude, where # denotes the complex conjugate of r. The 
equation for the amplification factor is obtained by substituting the Fourier compo- 
nent, (5), into the finite difference equation. This generally results in a polynomial 
in r. 
To demonstrate von Neumann stability analysis, let’s consider the FTCS 
method. The generic interior point finite difference equation for this method is 
(n+1) (nr) _ (n) (n) (n) 

we — we =A (wh, — Qu; +u), 
where A = DAt/(Az)?. Substitution of wy”) = re(3®) into the difference equation 
yields 

(rt) = r”)@(99) ~ ArtetG9) (9-18 ee e), 
or, upon simplification, 

r-l= Xe —~2+e%). 


Making use of the identity cos@ = (e~*® + e*)/2 leads to r = 1— 2X\(1 — cos6). r 
is clearly real-valued, so we require -1 <7 <1. Since 0 < 1— cos@ < 2, it follows 
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that 1-4 < 1—2X(1—cos@) < 1. Thus, for stability, we must have -1 <1—- An, 
or A < 1/2. This is exactly the same result we obtained via matrix analysis. The 
application of von Neumann analysis to the BTCS method and the Crank-Nicolson 
scheme will be left as exercises. 


EXERCISES 


1. Reconsider the initial boundary value problem 


3 a 
a = aot 3 u0,t)=u(l,4)=0, u(x,0) = Qsin(2rz). 


Using the FTCS method with Az = 1/40 and the indicated value for At, deter- 
mine the time ¢ when instability becomes apparent in the approximate solution. 
(a) At = 1/185 

(b) At = 1/190 

(c) At = 1/195 


. Use von Neumann stability analysis to show that the BTCS method is uncondi- 


tionally stable. The generic interior grid point finite difference equation for the 
BTCS method is 


wr) a wi”) ai (w riieee wrt) + uit?) 


where \ = D At/(Az)?. 


. Use von Neumann stability analysis to show that the Crank-Nicolson scheme is 


unconditionally stable. The generic interior grid point finite difference equation 
for the Crank-Nicolson scheme is 


ui) wi”) =A(w wit2) _ ow aes wht 4 yl”) wl” +ul)), 


where \ = D At/ [2(Az)’]. 


. (a) Show that if we apply Euler’s method to the system of semidiscrete equations 


given in equation (3) of Section 10.1 we reproduce the FTCS method. 
(b) Use the region of absolute stability for Euler’s method to show that the 
FTCS method is conditionally stable and requires \ < 3. 


. (a) Show that if we apply the backward Euler method to the system of semidis- 


crete equations given in equation (3) of Section 10.1 we reproduce the BT'CS 
method. 

(b) Use the region of absolute stability for the backward Euler method to show 
that the BTCS method is unconditionally stable. 


. (a) Show that if we apply the trapezoidal method to the system of semidiscrete 


equations given in equation (3) of Section 10.1 we reproduce the Crank- 
Nicolson scheme. 

(b) Use the region of absolute stability for the trapezoidal method to show that 
the Crank-Nicolson scheme is unconditionally stable. 
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7. (a) Suppose we apply the classical fourth-order Murge-Kutte method to the 
aysterm of semridisctebe equations given in equation (4) of Section 10.1, What 
restriction nmut be placed upon At to guarantee ubsdhite atahility? 

(b) Repeat part (a) for the two-step Adams-Moulton method. 

Ce ens 

oft girl) A(w °), — ay"? + wih), 
where & = 20-As/(Ax}* Determine the stability of thin echame. 

9. The generic interiors arud pornt finite difference equation for the Dukort-Pranhel 

taethod iz 


wh aft ~ fafa aft), 
where A = 2D At/( Ar)". Determine the stability of thie scheme. 


10, The PTCS methad, the BTCS method, and the Orank-Nicalson scheme ure ape- 
clai cases of the ¢-general Snits diflerence scheme, The o-genercnl acherne can be 


Written in the im 


Ip artivieg at this equation, bemogesenes Dirichlet bamndary comdlitions have 
been astumed. The iN — 1) x (N — 1} matrices A; and Ag are of the form 


Ay = (t+ 2eAjl—aAaNX 
Ag =| ~Mt—w#)as +{l—a@)AX, 


where A = 2A¢/(Az}* and 


(a) Show that ¢ = 0 correspends to the PTCE method, « = | cormmpends ty 
the BTCS method and o = 1/2 cocresponde to the Crank Nicolson echeme, 

(b) Show that the matrix Ay i mometngolar. 

(ce) Show that the o-goneral echeme is comditionully stable fr ¢ < 1/2 anet 
unconditionally stable foe ¢ > 1/2, 

(d) For o < 1/2, show that the method te stable for 


re 
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11. Consider the FTCS method and recall that the eigenvalues of Eptag are 
kr 
« = 1—4dsin® — 
Lig 1 oN 


fork =1,2,3,...,N—1. 
(a) Show that the vector v, whose components are given by 


_ kr 
sin —_— 
for {= 1,2,3,..., NW — 1 is an eigenvector associated with pup. 
(b} For N = 10, plot the components of vg versus /. Repeat for N = 40 and 


V39. 
(c} Using part (b) and the fact that y_, is the most negative eigenvalue of 


Heros, explain why instability manifests itself in the form of sawtooth os- 
cillations. 


10.3 MORE GENERAL PARABOLIC EQUATIONS 


The previous two sections presented a detailed examination of three different finite 
difference methods for approximating the solution of the heat equation 


ou atu 
a Page (1) 


with Dirichlet boundary conditions. The forward in time/central in space, or FTCS, 
method was found to be first-order accurate in the time discretization parame- 
ter, At, second-order accurate in the space discretization parameter, Az, and con- 
ditionally stable, requiring A = D At/(Ax)? < 1/2. The backward in time/central 
in space, or BT'CS, method was also found to be first order in At and second or- 
der in Az but was unconditionally stable. Finally, the Crank-Nicolson scheme was 
found to be second order in both At and Az and to be unconditionally stable. 

In this section we will address more general parabolic partial differential equa- 
tions than equation (1). In particular, we will treat the equations 


du Ou 
z= DW st 
ae Page 7 "9 
ane bu 
“apes — Aa, tut s(x, t), 


Ot Ga? 
where s{z,t)} is a source term and —f(z,t)u is a decay term. Note that we will 
continue to assume that the diffusion coefficient is constant. For each partial differ- 
ential equation, we will develop the finite difference equations for the FT.CS method, 
the BTCS method and the Crank-Nicolson scheme and will investigate any changes 
in stability. 
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Source Terms 


Let’s start generalizing the heat equation by including a source term only and 
continuing to specify Dirichlet boundary conditions. In other words, let’s develop 
finite difference methods for approximating the solution of the initial boundary 
value problem 


Ou *u 


Fy = Daa tslat), AS TSB t>0 

IBVP ¢ u(A,t) = ua(t) (2) 
u(B,t) = up(t) 
u(z,0) = f(z). 


In practice, a source term might represent the effect of precipitation in a groundwa- 
ter flow problem, the effect of internal heat generation (possibly due to the passage 
of an electric current) in a heat conduction problem, or the creation of particles in 
a molecular diffusion problem. 

To derive the finite difference approximation to (2), first introduce a uniform 
partition over the interval A < x < B, with mesh spacing Ar = (B — A)/N. Next, 
replace the second derivative term, 0?u/@z?, by its second-order finite difference 
formula. This produces the semidiscrete approximation 


Ae = aap ar + bel +e, v(0) =f. (3) 
Here 
v(t) =[ vi(t) ve(t) v(t) ~~ © walt) |’, 
b()=[ —ua(t) 0 - - - 0 -us(t) ]*, 
s(t)=[ s(t) s(x2,t) s(aa,t) - - - s(aya,t) |’, 
f=[ flor) fle) Flas) + + + Fewer) ]* 
and v;(t) © u(x;,t). The (N — 1) x (N — 1) matrix A takes the form 
f 2 =1 
—-l1 2 -1 
-i 2 -1 
Se 
-l1 2 -1 
-1 2 


Finally, we obtain the FTCS method by evaluating (3) at t = t, and replac- 
ing the time derivative by its first-order forward difference approximation. This 
procedure yields 


wl) = (7 — Aw — \b™ + Ass, 
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The BTCS method, 
(T+ AA)wl? 9 = wl) — bt) 4 Agger) | 


and the Crank-Nicolson scheme, 
(I+ AA)wh) = (1 Awl) — ) (b') + DiHD) 4 a (s™ +9049), 


ace obtained by evaluating (3) at ¢ = t,1, and replacing the time derivative by its 
first-order backward difference approximation and by integrating (3) from t = t, to 
t = tn41 and using the trapezoidal rule to estimate the integral on the right-hand 
side, respectively. The vector 


T 
a cS wl) yl) a 2 2. wl, 
is the fully discrete approximation to the solution of (2); that is, w; (°) Us(tn) © 


u(2;,tn), b™ = b(t,), ands) = s(t,). For the FTCS and the BTCS methods, 
\ = DAt/(Az)’, while for the Crank-Nicolson scheme, A = DAt/ [2(Azx)*|. The 
most important observation is that the inclusion of the source term in the partial 
differential equation has absolutely no effect upon the evolution matrix of the nu- 
merical method. Hence, there is no change to our previously established stability 
results. The FTCS method still requires A < 1/2, and the BTCS method and the 
Crank-Nicolson scheme are still unconditionally stable. 


EXAMPLE 10.7 A Test Problem with a Source Term 


Let’s consider the initial boundary value problem 


au 1 Ou , 

a = 2 ~t4 82", 0,¢) = 9, 1,t) = 8t, ulx,0) = 2sin(27z), 
a Ieee u(0, ¢) u(1,t) (x, 0) (202) 
whose exact solution is u(x,t) = 2e7(’/4)! sin(2rx) + 827t. The source term for 
this problem is s(z,t) = —t + 822, so the vector s™ is given by 


s() = [-t, + 8(Az)? — ty + 8(2Az)? — ty + 8(3 Az)? > — ty + 8(1 - Az)?|*. 


The function u(x, 10) is plotted in the upper left panel of Figure 10.7. The 
other panels of Figure 10.7 display the approximate solutions computed using our 
three finite difference methods. For each method, Ax = 1/20 was used. The upper 
right panel contains two approximate solutions computed using the FTCS method. 
With Az = 1/20, stability requires 


(1/20)? 1 
At < Say = 0" 
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Exact solution FTCS method 


s S 
¥40 / x 
3 Sf 2 
ae 
20 F 
wv ae 
io 
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x x 
BTCS method, At = 1/5 (A = 5) Crank-Nicolson, At = 1/5 (A= 5) 
80 80 
60} 60 


Figure 10.7 Exact solution to the initial boundary value problem 


du 1 au 2 

OL = 16 On? ~t+ 8x ; 

u(0,t)=0, u(1,t) = 8, 
u(xz,0) = 2sin(2rz), 


and approximate solutions computed using the FTCS method, the BTCS 


method, and the Crank-Nicolson scheme. All solutions are displayed at 
t= 10. 


The solid curve corresponds to At = 1/50. The dashed curve was computed with 
At = 2/95, which corresponds to 


_ (1/16)(2/95) 10 
— REG 0.526. 


The instability in this solution is clear. The approximate solutions computed using 
the BT'CS method and the Crank-Nicolson scheme are shown in the lower left and 
lower right panels, respectively. Each of these curves was obtained with At = 1/5 
(\ = 5), and neither displays any signs of instability. 
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Decay Terms 


Next, let’s introduce a decay term and develop Anite difference methods for the 
initial boundary value problem 


IBVP u(A,t) = tba (t) (4) 


where B(xz,t) > 0 for all « € [A,B] and all t > 0. Like the source term, the 
decay term can model a variety of different phenomena. For example, in a heat 
conduction problem, the term —A(z, t)z arises when there is convective heat loss 
from a lateral surface, while in a molecular diffusion problem, the decay term models 
the absorption of particles with a mean absorption rate of B(x, t). 
‘The semidiscrete approximation associated with problem (4) is 
dv (t) D 
aU = ~~ __ fAv(t) + b(t) — B()v(t) + s(t), v(0) =, 
= aay AMC) + BO] - BOVE) + (0, v(0) 

where the vectors v(t}, b(t), s(t), and f, and the matrix A, are the same as given 
above. B is the diagonal matrix 


B = diag({ B(ay,t), B(x2,t), Bles,t), -.., Blan-1,t) ). 


Working from the semidiscrete approximation and following the basic procedures 
described earlier, we obtain the FTCS method 


wth) = (TA -— AEB yw — Ao + Ats™, 
the BTCS method 
(L$ \A + AEBOTD wt) = wh) — pl) 4 Atst), 


and the Crank-Nicolson scheme 


( +AA+ Sat) wlth) — (1 ~rA- Sa) w” 
At 

fom 4 pern) 4 SE (6 4 glo) | 

d(b +b J+ Sls +s ) 


The presence of the decay term in the partial differential equation has resulted 
in a change to the evolution matrix of each method. To determine whether these 
changes have an effect upon stability, we will have to investigate further. Unfortu- 
nately, (x,t) is generally not constant, so the evolution matrices no longer match 
the special form given in Section 10.2. As a result, we will turn to von Neumann 
stability analysis. 
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The generic finite difference equation for the FTCS method is 


+1 ) 
wht — dul), + (1 2a ~ At”) wh” + reo, + At s™. 


The term Ais” is independent of w, so it will have no effect on the amplification 
factor for the discrete Fourier mode. We will therefore set this term to zero when 
we substitute wl = re), Upon substituting the Fourier mode and simplifying, 
we find 


r= (1-2A- Atal) + 2rc080 
=1~ Ate — 2X(1 — cos). 


Note that r is always less than 1 — At a, which for a > 0 is always leas than 1. 
As a function of 6, the smallest value taken on by r is 1 — At ge ~ 4%. Hence, 


we need 1 — At po —4) > —1 for stability. Recalling that \ = D At/(Az)?, after 
some algebraic manipulation, we arrive at the inequality 


2 
(eS, 

2D + (Ax)?e." /2 
Since we desire stability for all time steps and at all locations of the grid, the final 
stability condition is 

.\2 
ae 

2D + (Az)? (maxi) n—1n20 68 ) /2 

Hence, the FTCS is still conditionally stable, and a slightly smaller value for At is 


required to maintain stability. 
For the BT'CS method, we substitute wi) = re) into 
— dw + (1+ 22+ AE Blt?) wh) — dw? = wh”. 


The term At age has once again been dropped since it will not affect the ampli- 
fication factor. Following simplification, we find 


r= [(1 +20 4as6e) — 2rco06] 
= E + At Bert) + 2A(1 — cos 8)| an 


It is clear that 1+ A + 2\(1 — cos@) > 1 for all @ and all A; therefore, 


O0<r=([1+ mae + 2X1 — cos )| “<1, and the method is unconditionally 
stable. A similar result holds for the Crank-Nicolson scheme. The details are left 
as an exercise. 
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XAMPLE 16.8 A Test Problem with a Decay Term 


Let’s consider the initial boundary value problem 
Ot 16de2 ~ 4a(1 —ax)u + e7'(82* — t) 


u(0,t)=0, u(1,t)=8te*, u(z,0) = 2sin(2rz). 


The source term for this problem is s(x, t) = e~*(8x? — t), and the decay coefficient 
is B(z,t) = 4x(1 — 2x). Note that 


t)=1. 
joe ae) 

Figure 10.8 displays the approximate solution to this problem computed using 
our three finite difference methods. For each method, Az = 1/20 was used. The 
upper left panel contains two approximate solutions computed using the FTCS 
method. With Az = 1/20, stability requires 


(1/20)? sae 
SS S78) + (Pam) ~ aor ~ O88 


The solid curve corresponds to At = 2/101, while the dashed curve was computed 
with At = 10/499. The instability in the latter solution is clear. Note the difference 
six time steps make. The approximate solutions computed using the BTCS method 
and the Crank-Nicolson scheme are shown in the upper right and lower left panels, 
respectively. Each of these curves was obtained with At = 1/5, and neither displays 
any signs of instability. 


Application Problem 1: including the Effect of Rainfall 


Let’s reconsider the application problem from Section 10.1, “Rise in the Water 
Table Due to the Spring Thaw.” Suppose that during the spring thaw, the average 
daily rainfall is given by the function w(t). To incorporate the effect of this rainfall 
into our model, we need to add the term w(t) Ax At to equation (18) from the 
Chapter 1 Overview. Carrying through with the remainder of the derivation, we 
find that our revised model is 

Oh ah w(t) 

GO err a 

ot Ox? S 
where § is the storativity of the soil. 

As in Section 10.1, we take a = 509.76 m?/day, 


_ ntfs 
ea) 0) (0,1) = { a eres), ae and A(800, £) = 0.3h(0, ¢). 
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Figure 10.8 Approximate solutions to the initial boundary value prob- 


lem 
du 1 Ou 


_ 1 ou = -typ 2 _ 
3 16a? 4z(1—a)u +e “(8x — t), 
u(0,t)=0, u(l,t) = 8te*, ulaz,0) = 2sin(27z), 


computed using the FTCS method, the BTCS method, and the Crank- 
Nicolson scheme. All solutions are displayed at t = 10. 


Further, suppose S = 0.2 and 


_ f 0.004 m/day, t< 30 
wo={ 4 t> 30° 


Figure 10.9 displays the resulting change in the water table along the aquifer after 
30 days and after 60 days. Note how the water level diffuses once the rainfall has 
ended. Both solution profiles were computed using the Crank-Nicolson scheme with 
Az = 20 meters and At = 0.2 days. 


Application Problem 2: One-Dimensional Mode! for Color Photograph Devel- 
opment 


When photographic film is exposed to light, chemical reactions cause the silver 
halide grains in the film to acquire latent image sites. During the development 
of this image, a chemical in the developer solution known as a reduced developer 
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Figure 10.9 Change in the water table in an aquifer during the spring 
thaw, including the effect. of average daily rainfall during the first thirty 
days. 


deposits electrons at the latent image sites and becomes an oxidized developer. 
The oxidized developer diffuses and reacts with a dye-forming coupler to form an 
immobile dye and an inhibitor. The inhibitor then diffuses, and some of it adsorbs 
to the surface of the silver grain blocking the dissociation of the halides. The entire 
process is extremely complex, and modeling is not yet fully understood. 

Friedman and Littman (in Chapter 4 of Industrial Mathematics: A Course 
in Solving Real-World Problems, SIAM, Philadelphia, 1994) present a simplified, 
one-dimensional model of the development process by considering only the density 
of the oxidized developer, T(x), and the coupler, C{a}. The model consists of 
an initial boundary value problem for the oxidized developer plus an initial value 
problem for the coupler: 

oT OT oc 
The initial conditions for this system are T({z,0) = 0 and C(z,0) = Co(x), while 
the boundary conditions are T(0,t) = T'(L, t) = 0, where Lis the length of the film. 
E(a) is the exposure function which indicates those regions of the film which were 
exposed to light. 

We will simulate three minutes (180 seconds) of development time, taking as 
parameter values 


D = 100 pm?/s,k = 6.6 x 10!2um/moles -s,y = 7.5 x 107 moles/jum - s 
L=15x108~m and Cy = 1.125 x 107" moles/m. 
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To help balance the drastic differences in order of magnitude between the parame- 
ters, let’s introduce the following nondimensional variables: 


= ee £ 
C= =t d#=-—. 
t and £ Z 


In terms of these variables, the system of differential equations becomes 


oT OT a 7 ac aes 
OL OF k, TC T E(z), OE = —keTC 
where 
CokL? kyL4 
ky = 20 = 1.670625 x 1029 and kp = ~~ = 2.505975 x 10%, 


D2 


Furthermore, note that ¢ = 180 seconds translates tot = 8x 107". For the exposure 
function, we will assume 


(2) = 1, 01<2%<0.2 or 0.45 << % < 0.55 
~ ) 0, elsewhere. 


So how do we solve this system of differential equations? For each time step, 
we will first compute C("+) using a semi-implicit discretization of the coupler 
equation. That is, on the right-hand side of the coupler equation, evaluate T at 
time level n and C at time level n + 1. This yields the finite difference equation 


Ant. Aln An 
C HC pin) a . 

At oo 2 14 AfkoT™ 
By using a semi-implicit discretization we avoid having to solve a nonlinear algebraic 
equation to compute T("+), Since the method for advancing the coupler density 
is only first order in time, once C("+4) has been calculated, we will use the BTCS 
method for the 7’ equation: 


A(n+1) a(n) -a(n+1) a(n+1) A(n+1) 
Go Ba PT pean yp 
i (Az)? a a z 


Density profiles computed in this manner with AZ = 1/100 and At = 2 x 107° 
(i.e., 400 time steps) are shown in Figure 10.10. 


EXERCISES 


In Exercises 1-3, numerically verify that 


(a) the FTCS method is stable for A = 1/2 but unstable for any A > 1/2; 
(b) the BTCS method is unconditionally stable; and 
(c) the Crank-Nicolson method is unconditionally stable. 
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Figure 10.10 Oxidized developer and coupler density profiles after 
three minutes of development time. 


au ua 1. 
as = Fy t2t+e(1—2), u(0,t) = 1, u(t) =0, u(a,0) = 1-a—=sin(2nz) 
2 ae ay 3 +z u(0, t) =0, ulm, t) = nt, ula, 0) =sing 
3. a ze — (c+? Je” ’ u(0, t) S15 uln,t) =e", u(z, 0) =1+sinz 


In Exercises 4-6, numerically verify that 
(a) the FTCS method is unstable for \ = 1/2, but stable for 


2 
2D + (Az)?(maxi<j<w-1,n>0 6; °)/2 


(b) the BTCS method is unconditionally stable; and 
(c) the Crank-Nicolson method is unconditionally stable. 

c 1 
ae ot 2 zu, u(0,t)=1, u(l,t)=0, ul(z,0)=1l—-2- . sin(272) 
ou _ Ou _10 
at «Ox? 1 +2t 
bu Bu at 
* 6t — Ax? 

l+sinz 


——u, u(l0,t)=ulr,t}=0, ulx,0) =sing 


=e +t 26 a u(0, t) =], u(n, t) = e™, u(z, 0) = 
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7. Consider the partial differential equation 


Ou au 
a = Dag — Althu + o(c,2), 


where the coefficient on the decay term is a function of t only. 
(a) Let B(t) be any antiderivative of G(t), and define w(z,t) = PO) u(s, t). 


Show that ‘ 
Ow jw B(t) 
ae D ane +e 
(b) What advantage is there to applying the FTCS method to the equation for 


w rather than the equation for u? 


3(z,t). 


In Exercises 8-10, solve the indicated initial boundary value problem by first applying 
the technique of Exercise 7(a) to remove the decay term. Advance the solution tot = 5. 


du Pu -t 1, 
8. er reese u(0,t)=e", u(i,t)=0, u(z,0)=1l—2— - sin(2rz) 
du Fu 2t at ; 
9. OL = Bx? = ae as zy u(0, t) oa 0, u(r, t) a 1 4 #2? u(x, 0) = sing 
2 
10. 4 a aes —2tu, u(0,t)=u(1,t)=0, u(x,0) = 2sin(2rx) 


11. A 15-cm-long piece of copper wire is initially at a uniform temperature of 20°C. 
Att = 0, a10 amp current begins flowing through the wire generating heat, while 
the temperature at the ends of the wire is maintained at 20°C. The temperature 
within the wire satisfies the initial boundary value problem 


or | OT APR 


—= —— =T(0.15,¢) = 2 Ei = 20. 
pp ot Ox2 nD2' T(0,t) (0 5, ) 0, (z,0) 0 


The parameters in this problem are the mass density p = 8933 kg/m®, the heat 
capacity cp = 385 J/kg -° C, the thermal conductivity k = 401 J/m-s-°C, the 
diameter of the wire D = 2.6 mm, the current J, and the resistance per unit 
length 
6.8 x 10-8Q-m 

nD? ‘ 
Approximate the temperature profile along the length of the wire 30 seconds, 60 
seconds, and 90 seconds after the current has begun to flow. 


12. Rework the “Including the Effect of Rainfall” application problem with hydraulic 
diffusivity a = 414.72 m®/day, storativity $ = 0.5, initial and boundary condi- 
tions 


R= 


A(O, t) = h(800,t) =0, A(x,0) =0, 


and rainfall 
_ J 0.01 m/day, t < 20 
w(t) = { 0, t> 20. 


Determine the change in the water table in 10-day increments, up to t = 60 days. 


840 


13. 


14, 


15. 


16. 
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Two chambers are connected by a hollow tube 0.4 meters long and contain pools 
of ethy] alcohol that are maintained at 30°C. At ¢ = 0, the valves at each end 
of the tube are opened, and alcohol vapors diffuse into the tube. At 30°C, the 
diffusion coefficient of the alcohol vapors is D = 1.19 x 107° m? /s and 10% 
alcohol vapor is present in the air. The tube contains a filter with a mean 
absorption rate of 4 = 0.0069(s)~!. The percent of alcohol vapor within the 
tube, u(z, t), satisfies 


du. Ou 
at a2 
Approximate u(x,t) at ¢ = 100 seconds, ¢ = 200 seconds, and t = 300 seconds. 
A circular commercial bronze rod of radius r = 5 cm and length 1 meter initially 
has a uniform temperature distribution 50 K above ambient temperature. Att = 
0 the temperature at the right end of the rod is lowered to ambient. Let @ denote 
the difference between the temperature of the rod and ambient temperature. If 
the surface of the rod is exposed to convective heat transfer, then 0 satisfies the 
initial boundary value problem 


86 GDh 
—- = t-—- - = = 6 = 
Pep k a 6, 6(0,t)=50, O(1,t)=0, (a,0) = 50, 


u(0,t) = u(0.4,t)= 10, ufz,0) =90. 


where p = 8800 kg/m? is the mass density, cp = 420 J/kg - K is the heat capacity, 
and k = 52 W/m - K is the thermal conductivity of the rod. h = 25 W/m? -K is 
the convective heat transfer coefficient. Determine the temperature profile along 
the rod after 600 seconds, 1200 seconds, and 1800 seconds. 


Rework the “One-Dimensional Model for Color Photograph Development” ap- 
plication problem using the parameter values 


D = 550 pm?/s, k= 2.8 x 10°? um/moles -s, 
y= 3.2 x io? moles/pym : s, 
L=15x10° pm and Cy =8.3 x 107° moles/um. 


Elz) = 1, 01<#<0.2 or 0.35 < Z < 0.50 or 0.62 < & < 0.69 
(@) = 0, elsewhere 


as the exposure function and simulate five minutes of development time. 
Show that the Crank-Nicolson scheme 


(: +AA+ aa) with) — (1 EAS a3) w 


=~» (ote) eto) 4 BE (a) 49640) 
for approximating the solution of 


2 
du 75 — Bla tu + (2,4) A<x<B,t>0 
Ha 


e) 
IBVP u(A, 


| 


= f(z) 


ux, 0 


is unconditionally stable. 
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10.4 NON-DIRICHLET BOUNDARY CONDITIONS 


Consider the parabolic partial differential equation 


2 

se = 05% ~ B(z,t)u t+ s{z,t) (1) 
over the domain A <2 < Bandt> 0. We will assume throughout the section that 
G(z,t) > 0 on the entire domain. The objective of this section is to investigate the 
numerical solution of equation (1) with non-Dirichlet boundary conditions. The 
appropriate finite difference equations for the FT'CS method, the BTCS method, 
and the Crank-Nicolson scheme will be developed, and the effect of the boundary 
conditions upon stability will be determined. 

Before turning to the boundary conditions, recall that the generic semidiscrete 
equation associated with equation (1) is 


wil — puinlt) = ant $e) Hea seo, 0) 


In this equation v,;(t) + u(z;,t), Ax = (B- A)/N for some positive integer NV, and 
z;=A+t jAz for j =0,1,2,...,N 
A Model Problem 


For argument’s sake, let’s develop the FTCS method, the BT'CS method, and the 
Crank-Nicolson scheme for approximating the solution of the initial boundary value 


problem 
2 
. = - ~ B(x, thut+s(z,t), A<a2<B,t>0 
IBVP aren t) = a(t) 
On du 
p(t)u(B, t) + a(t) -(B,t) = v(t) 
u(z,0) = f(z). 


Note this problem has a Neumann condition at the left end of the domain and 
a Robin condition at the right end. One of each type of non-Dirichlet boundary 
condition has been selected so we can investigate any changes to stability criteria 
with a single problem. 

We will handle non-Dirichlet boundary conditions here the same way we han- 
died them in Chapters 8 and 9—by using fictitious nodes. The Neumann boundary 
condition at x = A leads to the semidiscrete equation 


dug(t) D —2uo(t) + 2u1(t) + 2 Az a(t) 


oe ha? Alo, t)00(t) + 5(20,4) (3) 
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while the Robin condition at 2 = B produces 
Quy_1(t) — 2 [1+ Ax?| u(t) + 2Ae rd) 
dunt) _ Nol ay un(t) + carey) 
ae (Az)? — Ben, tun (é) + (an, t). 
(4) 


Combining equations (2), (3), and (4) yields the complete semidiscrete approxima- 
tion to our model problem. In matrix form, we have 


dv 
FA Gap lA +O BOVE) +640 
v(0) =f", 
where 
v(t) =[ v(t) vw : ve(t) - + (@) J’, 
SO [-2Ara(t) 0 - - - 0 ae 
a. 8(t,,8} s(z2,t) - + +» 8(an,t) ie 
=[ flo) f(t) fm) > - + flew) |’, 
re ee me t), Bla, t), B(ao,t jiccay lend) 
and 
2 —-2 
1 2 -1 
-l 2 -ij 
-1 2 -1 
A= 
—l 9 -1 
-l1 2 —i 


2 2 Q - Axt() 


Note that the vectors each have N +1 components, and the matrices are (N +1) x 
(N +1). 
From the semidiscrete approximation, following standard procedures, we ob- 
tain the formulation for the FTCS method 
wit) — (1 — dA’ ~ At Bt) wi?) — pl 4 Ag gi) 
the BTCS method 


(T+ AAT + At BMH) wl) wo wl) — ABP 4 reg 
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and the Crank-Nicolson scheme 


(7 +A + Sere) with) = (1 — Al - = Bm) wit) 
Rin 
db pers + FO" Vg gf MD. 


For the FTCS and BTCS methods, A = D At/(Az)?, while for the Crank-Nicolson 
scheme, \ = D At/ [2(Az)?}. 


Stability Analysis 


We will start our stability analysis with the FTCS method. With a decay term 
in the partial differential equation and non-Dirichlet boundary conditions, the evo- 
lution matrix, Eprcs = I — XA’ — AtB’™, does not match the pattern of the 
special matrix we used extensively in Section 10.2. Therefore, we cannot compute 
the spectral radius exactly. Since we have different finite difference equations for 
different locations on the grid, we also cannot use von Neumann stability analysis. 
Fortunately, however, we can obtain an estimate for the spectral radius of Eprcs. 

First, note that the off-diagonal elements of the tridiagonal evolution matrix 
are all of the same sign. In this case, they are all positive. This implies that 
the eigenvalues of the evolution matrix are all real. See Smith [1] for a proof of 
this result. To actually estimate the locations of the eigenvalues, we will use the 
Gerschgorin Circle Theorem (see the Chapter 4 Overview, page 263). Knowing 
that the eigenvalues are real, the Gerschgorin circles all reduce to intervals along 
the real line. For stability, we will have to guarantee that all of these Gerschgorin 
intervals are contained in the closed interval [—1, 1]. 

For rows j = 2 through 7 = N of Eprcs, the diagonal element is 1 — 2A — 
At oe and the sum of the absolute values of the off-diagonal elements is 2A. Hence, 
the Gerschgorin interval is given by 


20 <2- (1-20 -aegl”) <2, 


or 
1-4\— Atp™ <2<1- At”. 


Under the assumption that ae > 0, the right endpoint of this interval is always 
less than or equal to 1. Stability then requires that the left endpoint be greater 
than or equal to —1, or 
-1<1-4\~ dcp”. 

This is precisely the condition we derived in Section 10.3. 

Along the first row of the evolution matrix, the diagonal element is 1 ~ 2 — 
At ee and the sum of the absolute values of the off-diagonal elements is again 2). 
The corresponding Gerschgorin interval is then given by 


-2\<z~(1-24-At Ay”) <2. 
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This leads to the condition 


4 


-1<1~4\- aigh? 


which is identical to the condition derived from the interior grid points. Hence, a 
Neumann boundary condition has no effect on the stability of the FTCS method. 

What about the Robin condition? Along the last row of the matrix, the 
diagonal element is 1-24 —y— At pe, where = 2A Ar p™ fg! and the sum of 
the absolute values of the off-diagonal elements is once again 24. The Gerschgorin 
interval obtained from this row is then 


-20<2- (1-2a-p— aes?) < 20. 
This leads to the stability condition 
-1<1-44~p-atge, 


which is slightly stronger than the previous conditions. A Robin boundary condition 
therefore tightens the bound for stability for the FTCS method. 

Putting all of our results together, we find that the most restrictive condition 
is the one obtained from the Robin condition. Recalling that A = DAt/(Az)*, the 
time step for the FTCS method must satisfy 


At< (Az)? 
~ 2D + DA Maxydo(p™ /gi)) + (Az)? MAX1<j<Nn>0 pn) /2 


Since we have only estimated the spectral radius, it is possible that this bound 
is overly restrictive. The accuracy of the estimate will be tested in the second 
example. 

When it comes to the BTCS method and the Crank-Nicolson scheme, neither 
the Neumann boundary condition nor the Robin boundary condition affects stabil- 
ity. Both methods remain unconditionally stable. The detaiis of each analysis are 
left as an exercise. 


EXAMPLE 10.9 A Test Problem with a Neumann Boundary Condition 


Let’s consider the initial boundary value problem 


Cu 1 &u H=t/9n2 

= tt _ 8a* -—7t 

Be 16 da 4e(1 —r)ute"(8¢ ) 
bu Ou a 
— = —(i,t) = 8t 5 
galOt)=0, Se(t) = Be (5) 


u(x, 0) = 2sin(2rx). 


The source term for this problem is s(x, t) = e7*(82? —t), and the decay coefficient 
is A(z,t) — 4a(1— 2). The boundary conditions match the pattern of our model 


problem with 
a(t) = 9 
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Figure 10.11 Approximate solutions to the initial boundary value 
problem given in equation (5) computed using the FTCS method, the 
BTCS method, and the Crank-Nicolson scheme. All solutions are dis- 
played at t = 6. 


and 


p(t)=0, g(t)=1, and r(t) = 8te*. 


Figure 10.11 displays the approximate solution to this problem computed 
using our three finite difference methods. For each method, Az = 1/20 was used. 


The upper left panel contains two approximate solutions computed using the FTCS 
method. With Az = 1/20, stability requires 


(1/20)? 
At S S76) + (1/20)2(1)/2 
= 240.0198. 


= 401 


The solid curve corresponds to At = 2/101, while the dashed curve was computed 
with At = 1/50. The instability in the latter solution is clear. The approximate 
solutions computed using the BTCS method and the Crank-Nicolson scheme are 
shown in the upper right and lower left panels, respectively. Each of these curves 
‘was obtained with At = 1/5, and neither displays any signs of instability. 
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XAMPLE 10.10 A Test Problem with a Robin Boundary Condition 


Let’s consider the initial boundary value problem 


a 1 OP, 

Ht = es se a4e(1 ~xjut e7'(8x* _ t) 

Ou ro) 

Aalst) =0, 2u(l,t) + 5, (ht) = 8te~? (6) 


u(x, 0) = 2sin(27r2}. 


The source term for this problem is s(z,t) = e~*(8x? —t), and the decay coefficient 
is (x,t) = 4a(1-— 2). The boundary conditions match the pattern of our model 
problem with 

a(t) =0 


and 
p(t)=2, q@(t)=1, and r(t) = 8te~*. 


Figure 10.12 displays the approximate solution to this problem computed 
using our three finite difference methods. For each method, Az = 1/20 was used. 
The upper left panel contains two approximate solutions computed using the FTCS 
method. With Ag = 1/20, stability requires 


(1/20)? 
+S S778) + (1/16) (4/20)(B) + (1/S0)2C)/2 


The solid curve corresponds to At = 1/53, while the dashed curve was computed 
with At = 2/101. The instability in the latter solution is clear, suggesting that 
our estimate for the spectral radius was quite accurate. The approximate solutions 
computed using the BT'CS method and the Crank-Nicolson scheme are shown in the 
upper right and lower left panels, respectively. Each of these curves was obtained 
with At = 1/5, and neither displays any signs of instability. 


1 
=a 0.01887. 


An Application Problem: Time-Dependent Temperature in a Fin 


In the Overview to this chapter (see page 803), we showed that the time depen- 
dent temperature, T(z,t), within a fin of circular cross section and constant cross- 
sectional area satisfies the initial boundary value problem 


oT k PT 2h 


at PCy OX? ~— lpr one 
or 
T(0,t)= f(t), T(#,0) =To, k = = A(T(L,t) — Too). 
Of | yop 


Here, r is the radius and L the length of the fin; p is the mass density, cp the 
specific heat, and & the thermal conductivity of the material from which the fin is 
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Figure 10.12 Approximate solutions to the initial-boundary value 
problem given in equation (6) computed using the FTCS method, the 
BTCS method, and the Crank-Nicolson scheme. All solutions are dis- 
played at t = 6. 


constructed; h is the convection heat transfer coefficient between the fin and the 
surrounding air; and T,. is the temperature of the air. The function f gives the 
time-varying temperature of the object to which the fin is attached. 

Suppose the fin has a radius r = 1.5 cm, a length L = 10 cm, and is made 
from aluminum. For aluminum, p = 2702 kg/m?, cy = 903 J/kg - K, and k = 237 
W/m - K. With h = 20 W/m? - K, T,, = 20°C and 


f(t) = 204 80 (1 _ ew) 


we obtain the temperature profiles shown in Figure 10.13. All calculations were 
performed using the Crank-Nicolson scheme with Ag = 0.005 meters and At = 0.2 
seconds. Note that the temperature within the fin has essentially reached steady- 
state by t = 300 seconds. 

Recall that a fin is designed to increase the heat dissipation rate from a larger 
object. Suppose the fin with which we’ve been working is attached to a square 
surface with side length s = 0.25 meters. Without the fin, the heat dissipation rate 
from this surface would be 


s*h(f(t) ~ Too) = 100 (1- e¥/?°) 
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Figure 10.13 Time-dependent temperature distribution along length 
of a fin with uniform cross-sectional area and experiencing convective 
heat loss from its lateral surface. 


The heat dissipation rate with the An is 


(3? ~ ar JACF(L) ~ Tog) — mh Z2 (0,2), 


where the second term accounts for the heat dissipated by the fin itself. Figure 10.14 
displays the twe dissipation rates as functions of time. To obtain this figure, the 
second-order forward difference approximation for the first derivative was used to 
calculate g (0, i). For t < 50 seconds, the heat dissipation rate with the fin is at 
least two to three times larger than the dissipation rate without the fin. At steady 
state, even with just one fin, the heat dissipation rate has been increased by nearly 


14.0%. 


Refere 
Teg 


nces 


D. Smith, Numerical Solution of Partial Differential Equations: Finite Dif- 


ference Methods, 3rd edition, Oxford Applied Mathematics and Computing Science 


Series, 


EXERCISES 


Oxford University Press, Oxford, 1985. 


In Exercises 1-3, numerically verify that: 


(a) 
(b) 


the FTCS method is stable for A = 1/2 but unstable for any 4 > 1/2; 
the BTCS method is unconditionally stable; and 
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Figure 10.14 Heat dissipation rate with and without a single fin. 


(c) the Crank-Nicolson method is unconditionally stable. 


du du 1, 
1. Be = gq tet el 2), u(z,0) =1—a— = sin(2mz) 
Ou _ ~4n7t Ou i 407+ 
5m (Orb) =1+2¢ t, anh = -l]-2e t 
duu Ou or bu = : 
2. Bt Ax? +2, By (0) Se at; an *) =-e “+t, ufx,0) =smnz 
Ou a Ou 2\ —2xt Ou, _ —ty Ou —t —Tt 
3. et age (ett Je“, By Ort) -e +t, nt *) -e "-te ™, 
u(z,0) =1+sing 
In Exercises 4-6, numerically verify that 
(a) the FTCS method is unstable for 
(Aa)? 


At ———__““*_ ______ 
2D + (Ax)*(max1<j<n,n>0 8} ")/2 


but stable for 


Ape ee 
2D + DAgmaxn>o(p™/q®) + (Aa)? maxr<j<nn>0 Be” /2 


(b) the BTCS method is unconditionally stable; and 
(c) the Crank-Nicolson method is unconditionally stable. 
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du Ou Ou du 1 
. OL = ope an _ 1, 2u(l,t)+ 2 (1, t) = 0, u(z, 0) = 1-—~ sin(2re) 
Ou Au 2 Ou Ou : 
Ot Ba? Tat an BMD t a) =0, ulz,0) = sine 
du ug? 2, -at Ou 1 Ou nt 
one haar u-(ztt*)e ’ aa uC ets a Amt) =e ? 


u(z,0) =1+sinz 


. A 15-crm-long piece of copper wire is initially at a uniform temperature of 20°C. 


Att = 0, a 10 amp current begins flowing through the wire generating heat. The 
temperature within the wire satisfies the initial boundary value problem 


OF Or ig 4° R 
Pop ee Ox? sD? 
oT oT 
ka (0, t) = A(T (0,4) — Too], — ka (0.15, t) = A[T(0.15, t) — Too], 
T(z,0) = 20. 


The parameters in this problem are the mass density p = 8933 kg/m, the heat 
capacity Cp = 385 J/kg .° C, the thermal conductivity k = 401 J/m-s -°C, the 
convective heat transfer coefficient h = 50J /m? -s+°C, the diameter of the wire 
D = 2.6 mm, the current J, and the resistance per unit length 


_ 68x 1078Q-m 


He aD? 


Approximate the temperature profile along the length of the wire 30 seconds, 60 
seconds, 90 seconds, and 120 seconds after the current has begun to flow. 


. Two chambers are connected by a hollow tube 0.4 meters long and contain pools 


of ethyl alcohol that are maintained at 30°C. At t = 0, the valves at each end 
of the tube are opened, and alcohol vapors diffuse into the tube. At 30°C, the 
diffusion coefficient of the alcohol vapors is D = 1.19 x 10-5 m?/s and 10% 
alcoho] vapor is present in the air, The tube contains a filter with a mean 
absorption rate of » = 0.0069(s)~1. The percent of alcohol vapor within the 
tube, u(z, t), satisfies 


2 
Be = pee —pu, u(0,t) = 10, 3% 0.2,1) =0, u(z,0)=0. 


Approximate u(z,t) at ¢ = 100 seconds, t = 200 seconds, and t = 300 seconds. 


. A circular commercial bronze rod of radius r = 5 cm and length 1 meter initially 


has a uniform temperature distribution 50 K above ambient temperature. Att = 
0 the temperature at the left end of the rod is lowered to ambient. Let 6 denote 
the difference between the temperature of the rod and ambient temperature. If 
the surface of the rod is exposed to convective heat transfer, then @ satisfies the 
initial boundary value problem 

30 OO 2h 


a6 
ages! = 08 4) = ee 
cay = F5cp~ 8 (0,4) = 0, RO(I,t) +5 = (1,4) =0, G(x, 0) = 50, 


10. 


11. 


12. 


13 


14, 
15. 
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where p = 8800 kg/m? is the mass density, cp = 420 J/kg - K is the heat capacity, 
and k = 52 W/m - K is the thermal conductivity of the rod. h = 25 W/m? K is 
the convective heat transfer coefficient. Determine the temperature profile along 
the rod after 600 seconds, 1200 seconds, and 1800 seconds. 

Soil consolidation is the hydrodynamic process by which water is expelled from 
saturated soil voids when the soil is compacted. If the layer underneath the 
compressible soil is impermeable, then, as the water is expelled, the pore water 
pressure, o(z,¢), satisfies 


2,0) = Ad, 9(0,)=0, SSeats,4) =o, 


where Cy is the coefficient of consolidation, Hg is the maximum drainage path 

(which, for this problem, is equal to the thickness of the soil layer) and Ad is the 

change in pressure due to the compacting force. Introducing the nondimensional 

variables 
@=¢/Ag, Z=2/Ha, and T =tCy/Hi, 
the problem becomes 
dd FO a6 
Or OZ?’ 0(Z,0) =1, &(0,T) =0, az (bt) =0. 

Plot (Z,T) for T = 0.05, T = 0.1, T = 0.25, and T = 1. 

Rework the “Time-Dependent Temperature in a Fin” application problem for a 

copper fin with p = 8933 kg/m”, cp = 385 J/kg - K, and k = 401 W/m-K. 

Rework the “Time-Dependent Temperature in a Fin” application problem for a 

stainless steel fin with p = 8055 kg/m*, cp = 480 J/kg -K, and k = 15.1 W/mK. 

(a) Suppose a fin has a square cross section with side length s. Derive the partial 
differential equation for the temperature T'(z,t) within the fin. 

(b) Let s = 2.65 cm. Using the parameter values, initial condition and boundary 
conditions from the “Time-Dependent Temperature in a Fin” application 
problem, calculate T(z, t) for t = 20, t = 60, t = 180, and ¢ = 360 seconds. 
How does the heat dissipation rate from the square fin compare with that 
of the circular fin? 

Derive equations (3) and (4). 

Use the Gerschgorin Circle Theorem to show that the BTCS method, 


(14 AAl + AsBOTD) wl FY = wl) — api) + Ais"), 
and the Crank-Nicolson scheme, 
(1+ ALS, pe) wrth) (« Shas ao) wi”) 
Mb™ is pir+)) + “eo ae g(t) 


are unconditionally stable. 
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10.5 POLAR COORDINATES 


In this section we will develop finite difference methods for the following form of 
the heat equation in polar coordinates: 


du De 3 ’ 
Be oe (*#] SAE 


As in previous sections, —@(r,t)u is a decay term, and s(r,t) is a source term. 
Equations of this type can arise from problems in cylindrical coordinates with axial 
symmetry and in which radial, rather than longitudinal, flow is important. 
Starting Off Basic 


Consider the initial boundary value problem 


du Ou 1 du 

at G +e) pepe 
Ou 

3, (04) =0 

u(R,t) = up(t) 

u(r, 0) = f(r) 


This problem has neither a decay term nor a source term. We will add these later. 
The Neumann boundary condition at r = 0 is a consequence of the symmetry of 
the problem and allows us to handle the artificial singularity in the differential 
equation. 

For the spatial discretization of the above problem, let Ar = R/N for some 
positive integer NV, and then define 7; = jAr for 7 = 0,1,2,...,N. Evaluate 
the partial differential equation at an arbitrary interior grid point r =r; (7 = 


1,2,3,...,.N—1), and use second-order central difference formulas to approximate 
the space derivatives. Dropping the truncation error terms yields the semidiscrete 
equation 
dus (t) _ py | inn d) = 20j(t) typ) | 1 wii) = -1(¢) (1) 
dt (Ar)? rj 2Ar ; 
where v,(t) © u(r;,t). Taking into account the definition of r; = 7 Ar, equation (1) 
becomes 
du; (t) 5 | 14) — 205 (4) + oj) A vyyi(t) - vj-1(¢) (2) 
dt (Ary? oj «(Ar 


Before evaluating the partial differential equation at rp = 0, we have to handle the 
artificial singularity. Taking the limit as r — 0, we find that the appropriate partial 
differential equation is : 
Ou Oru 

— = 2D—, (0,2). 
= (0,2) = 255 (0,0) 
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We once again use a second-order central difference formula to approximate the 
space derivative. The fictitious node in the resulting equation is eliminated by ap- 
plying the Neumann boundary condition at r = 0. The final semidiscrete equation 


is then ee C W) 
U0 t v(t) - Vo t 
i ( (Ary? ) 


For the temporal discretization, we will start with the FTCS method. Recall} 
that this method arises by evaluating the semidiscrete equations at time level t = ty 
and then replacing the time derivative with its first-order forward difference formula. 
Applying this procedure to equation (2) produces the finite difference equation 


(3) 


eee (: S a) wh, +(1—2d)wl” +d (: + a) wi), (4) 


for j = 1,2,3,...,N —1, where \ = DAt/(Ar)*. The finite difference equation 
obtained from equation (3) is 


wer) = (1 — 4dr + rw”). (5) 


Before turning to the development of the other finite difference methods, let’s 
investigate the stability of the FTCS method. From equations (4) and (5), we find 
that the evolution matrix, Epras, takes the form 


1-4. 4d 
M2 12d -Br/2 
3/4 1-2 5A/4 


(2N=5)A = (2N-3)d 


2N—4 IN—4 
(2N-3) 2 
2N-2 L- 


Since this is a tridiagonal matrix with all positive off-diagonal elements, the eigen- 
values are guaranteed to be real. To estimate the locations of these eigenvalues, 
we consider the Gerschgorin intervals associated with each row. For the FTCS 
method to be stable, we must have all of these intervals contained in the closed 
interval |—1, 1]. 

The first row of the evolution matrix produces the interval 


je (La4aj)/ so, 


or, equivalently, 1 - 8A < z < 1. To guarantee this interval is contained in the 
closed interval [—1,1] requires A < 1/4. Rows 2 through N ~ 1 produce the same 
Gerschgorin interval: 

lz ~ e 7 2A)| < 2A, 
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or 1—4\ < z <1. This interval will be a subset of [—1, 1] provided \ < 1/2. From 
the last row of the matrix, we obtain the Gerschgorin interval 


(2N — 3) 
stole fee 
je- (1~2a)| < a, 
which leads to the condition 
yodNat 2 
~6N-7 3 


The most restrictive condition on \ therefore comes from the first row. In con- 
clusion, the FTCS method is once again conditionally stable, requiring A < 1/4, 
or 
(Ar)? 

4D © 

Values of At that are less than or equal to (Ar)*/(4D) will definitely be 
sufficient to guarantee the stability of the FTCS method. However, since this bound 
was obtained using estimates for the eigenvalues of Eprcs based on the Gerschgorin 
Circle Theorem, it is possible that the bound may be overly restrictive. We will 
examine this issue further in the examples and exercises. 

Next, we will develop the BTCS method. Recall that this method arises by 
evaluating the semidiscrete equations at time level t = t,+, and then replacing 
the time derivative with its first-order backward difference formula. Applying this 
procedure to equations (2) and (8) yields the finite difference equations 


At<s 


1 os 1 > = 
— (1 = x] wy +(1+ 2d)wl*? — (1 + 5) rr ata = wi a 


for j = 1,2,3,...,N —1, and 
(1+ 4d) ut) — Arwl*) = w), 


As for the FTCS method, \ = D At/(Ar)?. 
Bach time step of the BTCS method requires the solution of a linear system 
of equations with the tridiagonal coefficient matrix 


1+4\ 4d 
afd? Ves. sea? 
—3\/4 142A —5A/4 


_Q@N-3A 49)  _@N-3a 


2N—-4 2N—3)d 2N—-4 
~ GN 1+ 2d 


The inverse of this matrix is the evolution matrix Egrcs. It is easy to show, using 
the Gerschgorin Circle Theorem, that the eigenvalues of the above matrix are all 


Section 10.5 Polar Coordinates 855 


greater than or equal to 1 in magnitude. Hence the eigenvalues of the evolution 
matrix all lie between 0 and 1, and the BTCS method is unconditionally stable. 

Finally, let’s develop the Crank-Nicolson scheme. Integrating both sides of 
equations (2) and (3) from ¢ = ty to t = t,4, and using the trapezoidal rule to 
approximate the integral on the right-hand side yields 


1 
= (: = =) wht) + (1+ 2r)wi"*) — d (a * =) with — 
x Ly (n) LY jg 
1- 3 wy) + (1 — 2A)wi +d lito ie 
for j = 1,2,3,...,N —1, and 
(1+ 4ayusrt? — arwiet? = (1 — 4ryw + rw. 


Here, \ = D At/ [2(Ar)*]. Like the BTCS method, each time step of the Crank- 
Nicolson scheme requires the solution of a linear system with a tridiagonal coefficient 
matrix. 


To investigate the stability of the Crank-Nicolson scheme, note that as in 
Section 10.2, we can express the evolution matrix as 


Econ = EptcsErres = —1 + 2E pres. 
Having just established that the eigenvalues of Egrcs all lie between 0 and 1, it then 


follows that the eigenvalues of En all lie between +1. Hence, the Crank-Nicolson 
scheme is unconditionally stable. 


EXAMPLE 10.11 Demonstration of Stability 


Let’s approximate the solution of the initial boundary value problem 


2 
=a (Rs ) O<r<1tso 


Ot 10 \ dr?" r ar 
Ou 
u(1,t) =0 


u(r, 0) = Jo(kr). 


Jo is the zeroth-order Bessel function of the first kind, and k & 2.404825558 is 
the smallest positive root of Jo. The exact solution of this problem is u(7,t) = 
e7 (7/10) Jy (er). 
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Exact solution 


FTCS method, At = 1/160 
4 35 
3 
3 
% % 25 7 
x x 2 
S? 3 
Ss XK 215 
3 
' ~ e 4 


(0,t)=0, u(i,t) =0, 
u(r, 0) = Jo(kr), 


and approximate solutions computed using the FTCS method, the BTCS 


method, and the Crank-Nicolson scheme. All solutions are displayed at 
t= 10. 


Figure 10.15 displays the exact solution and the approximate solutions com- 
puted using the FTCS method, the BTCS method, and the Crank-Nicolson scheme 
in the upper left, upper right, lower left, and lower right panels, respectively. All 
solutions are plotted at t = 10, and each numerical method used Ar = 1/20. The 


time step for the FTCS method was selected as the largest value allowed by our 
stability analysis: 


Apa (An? _ fo? 


4D 4(1/10) «160° 


At = 1/32 was used for both the BTCS method and the Crank-Nicolson scheme. 
There are no signs of instability in any of the numerical approximations. 
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Figure 10.16 Approximate solutions to the initial-boundary value 
problem 

du_ 1 (Au, 1du 

a 10 Ore Fae)" 


Ou > 
9, (0:4) =0, u(i,t}=0, 
u(r, 0) = Jo(kr), 


computed using the FTCS method with different time steps. All solu- 
tions are displayed at t = 10. 


Since our stability analysis was based not on analytical expressions for the 
eigenvalues of Hprcs, but rather on estimates obtained from the Gerschgorin Cir- 
cle Theorem, we should investigate the accuracy of the stability condition, \ < 1/4, 
which was obtained for the FTCS method. Figure 10.16 displays approximate solu- 
tions obtained from the FTCS method with At = 1/140, At = 1/120, At = 1/100, 
and At = 1/80. These time steps correspond to \ = 2/7, X = 1/3, A = 2/5, and 
A = 1/2, respectively. The solutions with the first three time steps appear to be 
stable, while the solution computed with At = 1/80 is clearly unstable. Hence, for 
this problem at least, the condition A < 1/4, though clearly sufficient to guarantee 
stability, is far from being a necessary condition. The actual stability condition for 
the FTCS method is somewhere between \ < 2/5 and X < 1/2. By computing the 
eigenvalues of Bprcg for At ranging from 1/10 to 1/120, the condition A < 0.41 
appears to be both necessary and sufficient to guarantee p(Eprcs) < 1. 
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Model Problem on an Annulus 


Suppose the domain is specified to be an annulus, Tinner <7 S Touter With Tinner 
strictly greater than 0, rather than a disk. The model problem would then be 


Ou 3 Ou 18u 
at (Fa r Or 
ulTinner; t) = Uinner (t) 
u(Touter, t) = Uouter (t) 
u(r,0) = f(r). 


} > Tinner STS Povuter, t > 0 


The Neumann condition at the left “boundary” has been replaced by a Dirichlet 
condition because there is no reason, in general, to expect that the solution will be 
symmetric about r = Tinner- Since r = 0 is no longer in the domain, the problem 
no longer has an artificial singularity. Equation (1) is therefore representative of 
al: of the semidiscrete equations. 

From equation (1), we obtain the following finite difference equation for the 
FCS method: 


+1 Le n Lb 
wir) = (a ts £) wl + (1 — 2a) + (: aL #) wl, 
3 3 


Here 73 = Tianere tj Ar, A = D At/(Ar)? and wp = D At/Ar. For the BTCS method, 
the finite difference equation is 


with the same expressions for A and wu. Finally, the finite difference equation for 
the Crank-Nicolson scheme is found to be 


i n+1) (n+) B (n#1) _ 
- (Q - | wit + (1+ 2d)w;" i ( + | we = 


iH n (n) HU (n) 
(a aa x wi, (1 = 2A)w; of ( + | Wears 


where A = D At/ [2(Ar)?] and p = D At/ (2Ar). 

If we continue to use the Gerschgorin Circle Theorem to localize the eigen- 
values of the evolution matrices, it is straightforward to establish that the BTCS 
method and the Crank-Nicolson scheme remain unconditionally stable. As for the 
FTCS method, recall that the condition \ < 1/4 was derived from the finite differ- 
ence equation associated with r = 0, which is now no longer present. The remaining 
equations lead to the relaxed stability condition A < 1/2. 
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EXAMPLE 10.12 Demonstration of Stability for Model Problem on an 
Annulus 


Let’s approximate the solution of the initial boundary value problem 
Ou (5 1 du 


+ Ol<r<l1t>0 


Bt 10 \ dr? pdr)’ 
(0.1, t) = e~ 7/10) Jy(0.1k) 
u(1,t) =0 


u(r, 0) = Jo(kr). 


Jo is the zeroth-order Bessel function of the first kind, and k ~ 2.404825558 is 
the smallest positive root of Jo. The exact solution of this problem is u(r,t) = 
e~ (8 /10)t Jo (kr), 

u(r, 10) is plotted in the upper left panel of Figure 10.17. The other panels 
display the approximate solutions computed using the FTCS method, the BTCS 
method and the Crank-Nicolson scheme. For each numerical method, the spatial 
discretization parameter was Ar = 1/20. The upper right panel contains two 
approximate solutions computed using the FTCS method. With Ar = 1/20, the 
stability condition A < 1/2 requires 

(1/20)? 1 


— 2(1/10) 80" 
The solid curve corresponds to this value of At. The dashed curve was computed 
with At = 1/79. This latter solution clearly exhibits the classical signs of instability. 
Hence, when the problem is defined over an annulus, the Gerschgorin intervals lead 
to a necessary and sufficient stability condition. 

The approximate solutions obtained from the BTCS method and the Crank- 
Nicolson scheme are displayed in the lower left and lower right panels of Fig- 
ure 10.17, respectively. Each of these solutions was computed with At = 1/8, 
which corresponds to 4 = 5. As expected, neither curve exhibits any signs of 
instability. 


More Genera! Problems 


Source terms, decay terms, and non-Dirichlet boundary conditions can be incor- 
porated into the finite difference methods that have just been presented following 
exactly the same procedures that were described in Sections 10.3 and 10.4. For 
example, the BTCS method for the problem 


Ou au 1 0u 

A =D (+t) — Br, thuts(r,t), O<r<R,t>0 
Ou 

u(R, t) = ur(t) 

u(r, 0) = f(r) 
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Exact solution FTCS method 
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ee N. 25 ‘, 
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Figure 10.17 Exact solution to the initial boundary value problem 


du 1 /@u 16u 
0 ae tra}? 
u(0.1,t) = e772 J0(0.14), 
u(1,t) = 0, 


u(r, 0) = Jo(kr), 


and approximate solutions computed using the FTCS method, the BTCS 
method, and the Crank-Nicolson scheme. All solutions are displayed at 
t= 10. 


consists of the finite difference equations 


G +4\+At agr*?) wt — aru t) = w 4 At git) 


L\ inty  f (nt1)\ , (n+) 
-1(1- =) uf + (1420+ Ae") wf 


: 


Zak (1 + =) wht) = wh + At srr) 


25 
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where the latter equation holds for 7 = 1,2,3,...,N -—1. As a second example, 
consider the FTCS method for approximating the solution of 
du au 1du 
=D 
t F YF ap 


) ao Br, thu ae 3(7, t), Pinner < rE < outers t > 0 


Ou 
or Ou 

p(t)u(Touter t) a q(t) (router, t) ee p(t) 
u(r, 0) = f(r) 


At fT =Tinner and T = Touter, the finite difference equations are 


(Tinner, t) = a(t) 


wir) = (12d - At Ag) wh” + 2awl” + At sf — 2 Aral”) (- #) (8) 


and 


respectively. The equation at each of the interior grid points é = 1,2,3,...,N-1) 
is 


wir) — (a- #) wi, + (1— 2d - Ati”) wl + (a+ x £) wl + Ate” 
(10) 
With regard to stability, the BTCS method and the Crank-Nicolson scheme 
remain unconditionally stable for all forms of the differential equation and for all 
combinations of boundary conditions. For the FTCS method, neither a source term 
nor a Neumann boundary condition produces any change in stability, but a decay 
term will reduce the maximum allowable time step. If the problem domain is an 
annulus, a Robin boundary condition will also reduce the maximum allowable time 
step. However, if the problem domain is a circle, a Robin boundary condition will 
produce no change in stability. 


Application Problem: Transient Temperature in a Heat Pack 


In Section 8.2 (see page 681), we developed a model for the steady-state temperature 
distribution in a heat pack. The corresponding model for the transient temperature 
distribution, T(r, t), is the initial boundary value problem 
OT k 10 f/f OT 2h q 
sae es ig aia 7 
Ot = pcp r Or \ Or 
oT 


aT 
Fe lit) =O, AT(R,t) + k= 


ay (Rat) = AT, T(r, 0) = Teo: 


862 


Here, p = 1591 kg/m? is the mass density, Cp = 237 J/kg - K the specific heat, 
and & = 0.4 W/mK the thermal conductivity of the material inside the pack; 
h = 20 W/m? - K is the convective heat transfer coefficient between the surface of 
the heat pack and the surrounding air and Ty, = 20°C is the temperature of the 
air; g = s00u00 W/m? is the rate of heat generation per unit volume within the 
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Figure 10.18 ‘Transient temperature distribution in a heat pack. 


pack; and R = 10 cm is the radius and w = 6 mm the thickness of the pack. 


The Crank-Nicolson scheme was used to compute the approximate temper- 
ature distribution within the heat pack with Ar = 1 mm and At = 0.2 seconds. 
Figure 10.18 displays the temperature at ¢ = 60, £ = 120, ¢ = 180, and t = 360 
seconds. The temperature distribution has essentially reached steady state after 


six minutes. 


EXERCISES 


In Exercises 1-3, numerically verify that 


(a) 
(b) 
(c) 


1. 


the FTCS method is stable for \ = 1/2 but unstable for any A > 1/2; 
the BTCS method is unconditionally stable; and 
the Crank-Nicolson method is unconditionally stable. 


du _18 (ou 
tr dr \ Or 
du 10 ( Ou 


) , ult) =0, u(3,t)=1, ulr,0) = (r-1)/2 


oe!) Mone bs 
ere ey, u(i,t) = 20(1-e"), Bp rt) = 9 u(r, 0) = 0 
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3. Be Te (G+ Ee , u{l/2 )=-31 2, u(l,t) = 0, u(r,0) =r? nr 


In Exercises 4-6, 


(a) numerically verify that the FTCS method is stable for \ = 1/4 but unstable for 


A= 1/2; 
(b) experimentally determine the largest value of \ which will guarantee p(Errcs) < 
1, 


(c) numerically verify that the BTCS method is unconditionally stable; and 
(d) numerically verify that the Crank-Nicolson method is unconditionally stable. 


Gi Ne ( | ; Ot a =0, u(1,t)=1, u(r,0) =r? 


Ot 7 or \' Or Or 

du _ 106 ( Ou Ou 2 Ou =a = 1» 
5. St +o ( ase 3p rt) = 9, 3, (bot) = 1, ulr,0) =1- or 

du _ 10 Ou eee bu = ou = = 
6. 5 ae (#) +e°", Bp ot) = 9, u(1,t) +578) =0, u(r,0) =0 


In Exercises 7-9, numerically verify that 


(a) the FTCS method is second-order accurate in space; 
(b) the BTCS method is first-order accurate in time and second-order accurate in 
space; and 

(c) the Crank-Nicolson scheme is second-order accurate in both time and space by 
approximating the solution of the indicated initial boundary value problem. 

du 010 / du du 
eo ee — — =; 1,t)=0 = 
at r ar (+52) d ar (0, #) Q, u( ) ) 1 u(r, 0) Jotkr) 

exact solution: u(r, t) = eW (P/O) Jo (kr), ky = 2.404825558 


Ou _ 10 du —t 1 = ~t Se: 
8. Fe aoe (~3) e (r+), u(1/2,t) = 5e In2,u(1,f)=e', 
u(r,0)=rt+Inr, — exact solution: u(r, t) = Inr + re~* 
Ou 10 Ou 2,4 4,3 Ou Ou 4 
ee SES = ae ~0, 4 yp SE i 
ss ot or Or (-) cease Or aa erie BAe 


u(r,0)=1, exact solution :u(7,t) = 1+ tr*t 

10. (a) Derive equations (6) and (7). 
(b) Derive equations (8), (9), and (10). 

11. The transient temperature distribution within an annular fin in a heating system 
satisfies the initial boundary value problem 


OT 10 oT 2h 2h 
Tage ee a —T=— CO; 
aE 1S (rae Wes 


T (rinner: €) = F(t), AT (router, t) + KE (roster t) = hT eo, T(r, 0) = Too 
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12. 
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For each of the following materials, determine the temperature distribution in 
the fin after 600 seconds. In each case, take h = 50 W/m? -K, w = 5 mm, 
Too = 20°C, f(t) = 20+ 40(1 — e7*/129) roe = 2 em, and router = 5 em. 
(a) pure copper fin: p = 8933 kg/m?, cp = 385 J/kg - K, # = 401 W/m-K 

(b) stainless steel fin: p = 8055 kg/m®, cy = 480 J/kg - K, k = 15.1 W/m-K 
(c) commercial bronze fin: p = 8800 kg/m®, cp = 420 J/kg - K, k = 52 W/m-K 
Consider the following simplified model for insect dispersal. Let n(r,t) denote 
the insect population as a function of location, r, and time, ¢. Starting from 
an. initial population distribution no(r), the insects disperse randomly with a 
constant coefficient of diffusion, D. The mortality of the insects is proportional 
to population size, and there exists a finite interval which can sustain insect life. 
These assumptions lead to the initial boundary value problem 


on DO On On 
a aa (“#) —An, — n(r,0) = no(d), Bp Ont) =nL,t) =9 


for n(r,t). Assume that the insect population is measured in thousands. Take 
L = 1 meter, D = 0.05 m*/day and \ = 0.1 (day)~*. The initial population 
distribution is given by 


4 
no(r) = ie ae ue 
0, otherwise. 


Simulate the dispersal of the insect population over a four-day period. 


18. Water is pumped from a well, one meter in diameter, at a daily rate of Q cubic 


14, 


meters. The well water is drawn from a confined aquifer that extends for a 
radius of 100 meters from the center of the well. The drawdown on the water 
level within the aquifer, A{r, t), satisfies the initial boundary value problem 


oh (3 13) dh Q Oh 


a S Var? * rr Or aE Br\_ 


=0, h(r,0)=0, 
100 


r=1/2 

where T is the transmissivity and S the storativity of the aquifer. 

(a) Suppose T = 150 m?/day, S = 0.2, and Q = 100 m°/day. Determine the 
drawdown within the aquifer after 15 days of pumping. 

(b) Suppose that during the pumping period, average daily rainfall is w meters. 
To account for precipitation, add the term w/ S$ to the right-hand side of the 
differential equation for A(r,t). Using the same parameter values listed in 
part (a) and w = 0.003 m/day, determine the drawdown within the aquifer 
after 15 days of pumping. 

The finite difference equations for the FTCS method applied to the initial bound- 

ary value problem 


Ou D (3 1 du 


a = Ore s =) — B(r, thu +s(7,t), Tinner ST S Touter, b > 0 


du - 
Bp (inners t) = a(t) 


p(t)u(router, t) aR q(t) ou (Touter; t) = p(t) 
u(r, 0) fe f(r) 
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are presented in the text. 
(a) Derive the stability condition associated with these equations. 
(b) Numerically verify the stability condition found in part (a) for the problem 


du _ (Bu, 10% 
ot” ~\ ar? ° rar 


\ au, 1/2<r<2,t>0 


Bu 19 4) = 2e7' 
gs Ou 1 

noe Bane (San) 
u(2,t) + 5, (2) ev\s +In2 
u(r,0) = Inr. 
The exact solution for this problem is u(r, t) = e~? lnr. 


10.6 PROBLEMS IN TWO SPATIAL DIMENSIONS 


To conclude our treatment of parabolic partial differential equations, we will con- 
sider problems in two space dimensions. We will restrict attention to the rectangular 
domain R = {(z,y)|a <2 < b,c < y < d}, and our model problem will be the heat 
equation 

Gu Pu Au 

—=)|—+ — 

Ot (sa st) 
subject to Dirichlet boundary conditions. Source and decay terms will be included 
toward the end of the section. Non-Dirichlet boundary conditions will be considered 
in the exercises. 


The FTCS Method 


Let N, be the number of subintervals along the z-axis and Ny be the number of 
subintervals along the y-axis. Define the spacing parameters 
b-a d-c 


Ne and Ay= N, 


> 
8 
il 


and the gridlines z} = a+jAza and y, =c+kAy. For simplicity, we will assume 
throughout the section that Az = Ay = A. Next, evaluate the heat equation at 
the arbitrary interior grid point (x;,y,), and approximate each space derivative by 
its second-order central difference formula. Finally, drop the truncation error terms 
to produce the semidiscrete approximation 


dint) _ (eae? = 2u;,0(t) + Vjti6() Vj e—1(t) — 20; 4.(t) + sass) 
dt A? A? 
(1) 


Here v5, 4(t) © u(x4, yey t)- 
To obtain the FTCS method, evaluate the semidiscrete template at time level 
t = t, and then replace the time derivative with its first-order forward difference 
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approximation. Solving the resulting fully discrete equation for the value of the 
approximate solution at time level ¢ = t,41 yields 


(n+1) _ n 
wk =} (wf. + we + we, + ws) +(1- aru), (2) 


where wi) Uz k(tn) & U(L;, Ye, tn) and A= DAt/A?. 

What is the maximum allowable value for \ that will produce a stable ap- 
proximate solution? Let’s perform a von Neumann stability analysis. In two space 
dimensions the discrete Fourier mode takes the form 


wi = pt giGOr kOe) 
Substituting this expression into equation (2) gives 


pMt] i901 +k02) _ [A (e7* + et 4 9-182 + e's) 4+1- 4)] pt et(90r +hO2) 


Dividing out r®e*%+49) and using the identity e + e~# = 2cos0, we find 


r = 2A(cos 6, + cos 69) +1—4A 
= 1— 2A(2 — cos 6, — cos 42). 


Since 0 < 2 — cos @, — cos 2 < 4 for all values of 6; and 43, it follows that 1~ 8A < 
r <1. To guarantee |r| < 1, we must therefore choose A so that 1— 8 > —1, 
which requires A < 1/4. Hence, the stability condition for the FTCS method in two 
space dimensions is even more restrictive than in one space dimension. Recalling 
the definition of A, the restriction on the time step is given by 
A2 

One programming note needs to be made before moving on to an example. 
When equation (1) is applied with j = 1,7 = Nz—-1,k = lork = N,~1, at least one 
of the w values will be obtained from the Dirichlet boundary conditions. The best 
approach for handling these cases is to dimension the w matrix to (N;,+1)x(N,+1), 
rather than (Nz — 1) x (Ny — 1), and load the boundary condition values into 
appropriate locations of the matrix. At the expense of a few memory locations, the 
code for the algorithm will be simplified tremendously. 


EXAMPLE 10.13 Stability of the FTCS Method in Two Space Dimensions 
Consider the initial boundary value problem 


duo (3 Oru 


Ot 16 \ Bx? * By? 


y O<saeslOsy<l,t>0 
u(0, y, t) = u(l,y,t) =0 
u(z,0,t) = u(z,1,t)=0 


u(z, y, 0) = sin(2rx}sin(27y). 
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The exact solution to this problem is u(z,y,t) = eW(F?/2)t sin(27z) sin(2ry). Two 
approximations to u(x, y,5), obtained using the FTCS method with different size 
time steps, are displayed in Figure 10.19. Spatial resolution was set at A = Ax = 
Ay = 1/20 for all calculations. With A = 1/20, the maximum allowable time step, 
based on stability considerations, is 


AP (L207 
4D ~ 4(1/16) 100 


Atmax = 


The approximate solution in the top graph of Figure 10.19 was calculated 
with a step size of Atmax. The plotted surface exhibits no signs of instability. The 
approximate solution in the bottom graph, which is clearly unstable, was computed 
with a time step of At = 1/98. This time step corresponds to A #& 0.255. 


Alternating Direction Implicit (ADI) Method 


The stability condition A < 1/4 generally makes the FTCS method too expensive 
for practical use. In one space dimension, we found that the BTCS method and 
the Crank-Nicolson scheme were unconditionally stable and required little addi- 
tional computational effort over the FTCS method. For the two-dimensional heat 
equation, the BTCS method is given by the finite difference equation 


=D (wD + wh + wi) + whTAY) + (1+ dayne) = wi, (3) 


where \ = DAt/A?, and the Crank-Nicolson scheme is given by 


= (wit 7 + wit? + wt?) + wi?) + +(1+4)u, mH) 


(n (n) n 
d{ a), wl, Fw _+0,) #00—anye (4) 


where \ = DAt/[2A?]. Von Neumann stability analysis easily establishes that 
both of these methods are unconditionally stable (see Exercise 1); however, the 
coefficient matrix for the system of equations that must be solved during each time 
step is no longer tridiagonal. Fortunately, equations (3) and (4) are trivial modi- 
fications of the discrete Poisson problem we treated in Chapter 9. The multigrid 
method can therefore be used to efficiently solve the corresponding systems. We 
leave it to the interested reader to explore this issue in more detail. 

The generally preferred approach for dealing with the two-dimensional heat 
equation is to use one of a class of alternative differencing schemes that results in 
tridiagonal coefficient matrices. The first such scheme was proposed by Peaceman 
and Rachford [1]. Each time step of the scheme is carried out in two half-steps. 
First, the approximate solution is advanced from time level t = f, to the interme- 
diate level t = t,41/2. The time derivative on the left-hand side of equation (1) 
is replaced by its first-order forward difference formula. On the right-hand side of 
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Figure 10.19 Two approximations to u(z,y,5), where u(x, y,t) is the 
solution to the initial boundary value problem 


& ~ 16 aa? * 
u(0, y, t) =u(l,y,t) =0, 
u(x,0,t) = u(x, 1,t) = 0, 

u(x, y, 0) = sin(2mra) sin(2ry). 


du_ 1 (au ay 


Approximate solutions were computed with the FTCS method with dif- 
ferent step sizes. 


equation (1), we evaluate the terms from one of the space derivatives at t= t, and 
the terms from the other space derivative at t = t,41/2; that is, one derivative is 
treated explicitly, the other implicitly. It does not matter which derivative is han- 
dled in which way, for in advancing the solution from t = th4i/2 to t = tn4i, the 
treatment of the terms will be reversed. Thus, if we choose to treat the x derivative 
terms implicitly first, then the values at the intermediate time level are defined by 
mash (winty 4+ witti/2) 


(n+1/2) 
ey for ) + (1+ 2d)!" 


= (wl, + wis) +(1- 2r)ws"), (5) 


where \ = DAt/(2A?). The approximate solution at ¢ = ta4i is ultimately com- 
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puted by handling the y derivative terms implicitly: 


a2 (w a) i uti?) + (1+ 2A)w, ae 
(n+1/2 +1/2) ae 2 
aa (afr wh) (1 -2ryury?. (6) 


Since this scheme alternately treats the x derivative and then the y derivative 
implicitly, it is known as the alternating direction implicit, or ADI, scheme. 

It is important to note that equation (5) actually represents a set of Ny ~ 1 
disjoint tridiagonal systems of equations, one for the values along each row of the 
computational grid. In matrix notation, each of these systems can be written as 


(n42/2) (yim) (n) (1/2) 
FLW RK) d(w R(h- thu) + (1 2A) Wit) + By 


where the subscript R(k) denotes the kth row of the grid (for k = 1,2,3,...,Ny—1), 
Ez is the (Nz — 1) x (Nz — 1) tridiagonal evolution matrix 


1+2 —A 
~A\ 1+2A  ~xr 
~X 142A -r 


~-A 1424 —-A 
-A 1+2r 


and be = Aw ee 0. .--+ 9 br ara be The vector ba 


contains values along hs kth row that are known from the boundary ae 
In a similar manner, equation (6) actually represents a.set of N, — 1 tridiagonal 
linear systems, one for the values along each column of the grid. These systems can 
be written as 


41) (n4.1/2) , © (n41/2) (ttl) 4 (nts) 
ByweG) => (wig? 1) + WoG 44) ) + (1 = 2A)w aay + bag) > 


where the subscript C(j) denotes the jth column of the grid (for 7 = 1,2,3,...,Ne— 
1), the evolution matrix E, has the same tridiagonal structure as Ez but is of 
dimension (Ny ~ 1) x (NV, — 1), and 


(n4+1) (n+1) ee 
bey = | Awl 0 2 ee O dw i 


Taking into account the computation of the right-hand-side vectors and the 
solution of the tridiagonal systems, each full time step of the ADI scheme uses 
roughly 24 algebraic operations per grid point. This is four times as much work 
as the FTCS method, which requires 6 operations per point. Fortunately, the 
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ADI scheme is unconditionally stable, so fewer time steps can be performed. To 
establish this result, we substitute the two-dimensional discrete Fourier mode into 
equation (5) to obtain the partial amplification factor 


rae: 1 — 2A + 2A.cos 49 

= 1+ 2A —2Acos 6 | 
From equation (6), we obtain 

pe 1 — 2A + 2A cos 6, 

yO 442 — 2Ac08 62 


Therefore, the total amplification factor per time step is 


(1 — 2A 4+ 2Acos4)(1 ~— 2A + 2A cos 6;) 
(1 + 2A — 2Acos61)(1 + 2A ~- 24.cos 82)’ 


TS Pgly = 


which is always less than or equal to one in magnitude. Although we used a first- 
order difference formula for the time derivative, the overall truncation error of the 
ADI scheme is second-order in both time and space (see Morton and Mayers [2] for 
details). This is a second factor that allows for fewer time steps to be taken with 
the ADI scheme. 


EXAMPLE 10.14 The ADI Scheme in Action 
Consider the initial boundary value problem 
au 1 (Fu, Oy 
Ot 16 \ Ax? © dy? 
u(0,y, t) = u(l, y,t) = 0 
u(z,0,f) = u(z,1,t) =0 
u(z, y, 0) = sin(aa) sin(7y). 


i, O<eSlo<ysilt>0 


Let’s approximate the solution to this problem at t= i using the ADI scheme. We 
will take A = 1/4 and At = 1. Then 
DAt (1/16)-1_ 1 


= Sar = oq yap 2 
The value of A implies that VN, = N, = 4. Therefore, each half time step will 
require the solution of three 3 x 3 tridiagonal systems of equations. Furthermore, 
the evolution matrices, #,, and E,, will be identical: 


2 Ebs -G 
AF, D> S12 
0 aif 2 


With A = 1/2, 1 — 2\ = 0 and one of the terms from the right-hand side of each 
of the systems vanishes. Since the problem has homogeneous boundary conditions 
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specified around the entire boundary, all of the vectors bray 1) and boo. will 


also be zero. 
From the initial condition, we find 


1/2 2/2 1/2 
(0) 0 
Wri) = V2/2 |, Wiis} = 1 , wos = se | 49/2 
1/2 V2/2 1/2 


Therefore, the first half time step requires the solution of the systems 


Be: toe - Wi 0 v2/2 v2/4 
(1/2) 
~Y2 2-1/2 | wey = o | + 1 =| 1/2 |, 
0 -1/2 2 0 J2/2 V2/4 


1% © 1/2 1/2 1/2 
ip 2 “ie |i (8b 8)-18) 
01 S12 2 1/2 1/2 1/2 


o- Ps -O Oem allt V2/2 J3/4 
-1/2 2 -1/2 |wRP=s[fol+] 1 =| 1/2 |. 
0. fe 2 COB N Oh: - | e778 J3/4 


The solutions of these systems are found to be 


ee (44 v3)/14 
a | ey | wal) = melvin | 
ee ae (4+ v3)/14 
and 
(1 + 2V2)/14 
we) -| (4+ V2)/14 | 


(1 + 2V/2)/14 


From these solutions, we construct 


(1 + 2/2)/14 (4+ ¥2)/14 
Woy = | (4+ ¥2)/14 | oe | (1+ 2v2)/7 
(1+ 2/2)/14 (4+ v2)/14 
and. 
(1+ 2/2)/14 
w= | (4+ ¥2)/14 |, 
( 


1+2/2)/14 
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and the systems 


“2-1/2 0 i/{° (44+ /2)/14 
1/2 2  -1/2 Won) = 5 0) 4+] (+2¥2)/7 
0 -1/2 2 0 (4+ /2)/14 
(4+ 2) /28 
= eye | 
(4+ V2) /28 
" 2  -1/2 0 i f/| G+2v2)/14 (1+ 272)/14 
-1/2 2 -1/2 O00) = 5 we || oh 
G0 ~1/2 2 (1+ 272)/14 (1+ 2v2)/14 
(1+ 2/2)/14 
-| (4+ /2)/14 | 
(1+ 2/2) /14 
“ 2 -1/2 0 1 {fe (44+ /2)/14 
-1/2 2 -1/2 ee 2+] tay |) 
0 -1/2 2 0) (4+ V2)/14 
(4 + /2)/28 
= | (1+ 2/2} /14 | , 
(4 + /2)/28 


which must be solved to complete the time step. The solutions of these systems are 


(9 + 45/2) /98 (8 + 9/2) /98 
Woy = (8 + 9/2) /98 [ Wee) = | (9 + 4/2) /49 | ; 
(9 + 4/2)/98 (8 + 9/2) /98 


and 


w 


cin = | (8+9V2)/98 

(9 + 4/2}/98 
The following table compares the approximate solution at tf = 1 with the exact 
solution, u(z, y, 1) = e7(* /8) sin(wx) sin(ary). 


w | (9 + 4/2) /98 | | 


r y wd Exact Absolute Error 
0.25 0.25 0.149560 0.145606 0.003953 
0.25 0.60 0.211509 0.205919 0.005591 
0.25 0.75 9.149560 0.145606 0.003953 
0.50 0.25 9.211509 0.205919 0.005591 
0.50 0.80 6.299119 0.291213 0.007807 
0.50 0.75 0.211509 0.205919 0.605391 
0.75 0.25 6.149560 0.145606 0.003953 
0.75 0.50 0.211509 0.205919 0.005591 
0.75 0.75 0.149560 0.145606 0.003953 
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EXAMPLE 10.15 Convergence and Stability Properties of the ADI Scheme 
Let’s reconsider the initial boundary value problem 
(oa 
Ot 16 \ 0x2 Oy? 
u(0,y,t) = u(1,y,t) =0 
u(z,0,t) = u(z,1,t) =0 
u(z, y, 0) = sin(zz) sin(zy) 


iF O<zr<1,0<y<1,t>0 


whose exact solution is u(z,y,t) = en (7/8)! sin (ra) sin(ry). The following two 
tables provide a numerical verification of the second-order accuracy of the ADI 
scheme in both time and space. The first table summarizes the error in the approx- 
imate solution at t = 1 for different values of A. For all calculations a time step of 
At = 1/100 was used. Each time A is cut in half, the error decreases by a factor of 
four. The second table presents the error in the approximate solution at t = 1 for 
different values of At, with A fixed at 1/100. For each reduction in the step size 
by a factor of two, the error drops by a factor of four. 

The unconditional stability of the ADI scheme is shown in Figure 10.20. Here, 
the approximate solution at t = 10 is plotted. This solution was computed with 
ten time steps; that is, the step size was taken to be At = 1. Spatial resolution was 
set at A = 1/25. Thus 


DAt (1/16) _ 625 
“BAe ~~ 9(1f25)? ~ <32 


A = 19.53. 


Even with a value of 4 this large, the approximate solution exhibits no signs of 
instability. 


Second-Order Accuracy in Space 


aA Maximum Absolute Error Error Ratio rms Error Error Ratio 
1/4 0.0186651851 0.0093325926 

1/8 0.0046286252 4.032555 0.0023143126 4.032555 
1/16 0.0011539171 4.011229 0.0005769585 4.011229 
1/32 0.0002874745 4.013980 0.0001437373 4.013980 
1/64 0.0000710048 4.048662 0.0000355024 4.048662 


Second-Order Accuracy in Time 


At Maximum Absolute Error Error Ratio rms Error Error Ratio 


1 0.0118110448 0.0059055224 
1/2 0.0028450761 4.151398 0.0014225381 4.151398 
1/4 0.0006840195 4.159349  0.0003420098 4.159349 


1/8 0.0001485288 4.605300 0.0000742644 4.605300 
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weiey, 10) x 108 


Figure 10.20 Demonstration of unconditional stability of the ADI 
scheme. Approximate solution was computed with \ ~ 19.53. 


Source and Decay Terms in the ADI Scheme 
Let’s now discuss the inclusion of source and decay terms into the ADI scheme. In 
particular, we will consider the partial differential equation 


Ou (alg a 
Bt =D (sa + ii) — B(x, y, thut s(x, y,t). 


As in previous sections, we will assume that G(x, y,t) > 0 for all (a, y) € R, where 
R= {{x,y)|a<z<b,ce<y<d}, and for all t > 0. The spatial discretization 
of the additional terms is straightforward—simply evaluate each term at the grid 
point (x;,y,)- To maintain unconditional stability and second-order accuracy in 
time, the time discretization of the source term and the decay term must be carried 
out by averaging the values at the start and end of each half time step. Thus, to 
the right-hand side of equation (5), we must add 


At At 2 
sere taons he [sir 40th | ' 


while to the left-hand side we add 
At 3 n+1/2 n+1/2 
4 ri ia We - , 
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As for equation (6), 


At a(n+1f2), (ntif2) , At | (m+1/2) (r+) 
“Thiet Be 95 


is added to the right-hand side, and 


At (ntl). (ntl 
sore yee 


is added to the left. 

EXAMPLE 10.16 Convergence and Stability of the ADI Scheme with 
Source and Decay Terms 

Let’s approximate the solution of the initial boundary value problem 


du 1 (#u 7 Oru t Baty? = tia? +9) 
ot 16 \ Gx? * By? 


ei VPEL °° 
O<2<1LO0<y<1t>0 

u(0,y,t) =0, u(1,y,t) = 8ty*/Vt? +1 

u(z,0,t) =0, u(x, 1,t) = 8te?/Vt?2 +1 


u(x, y, 0) = sin(az) sin(7y). 


The source term for this problem is 


(2,4, t) ~ 


and the decay coefficient is 


G(a,y,t) = 


+1 
The exact solution is 
eW("/8)t sina) sin(ary) + 8tx2y” 


u(x, y,t) = Je+1 
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The following two tables provide a numerical verification of the second-order 
accuracy of the ADI scheme in both time and space. The first table summarizes 
the error in the approximate solution at t = 1 for different values of A. For all 
calculations a time step of At = 1/100 was used. Each time A is cut in half, the 
error decreases roughly by a factor of four. The second table presents the error in 
the approximate solution at t = 1 for different values of At, with A fixed at 1/100. 
For each reduction in the step size by a factor of two, the error pagan drops by 


roughly a factor of four. 
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Figure 10.21 Demonstration of unconditional stability of the ADI 


scheme in the presence of source and decay terms. Approximate solution 
was computed with A = 19.53. 


The unconditional stability of the ADI scheme is shown in Figure 10.21. Here, 
the approximate solution at t = 10 is plotted. This solution was computed with 
Ai = 1. Spatial resolution was set at A = 1/25. Thus 


_ DAt_ (1/16) _ 625 _ 
A= Sar = papa = ge 1958: 


Even with a value of this large, the approximate solution exhibits no signs of 
instability. 


Second-Order Accuracy in Space 


A Maximum Absolute Error Error Ratio rms Error Error Ratio 


1/4 0.0131985079 0.0065994016 

1/8 0.0032730175 4.032520 0.0016367104 4.032113 
1/16 0.0008159919 4.011090  0.0004082126 4.009456 
1/32 0.0002033157 4.013423  0,0001018836 4.006657 


1/64 0.0000502552 4.045661 0.0000253737 4.015320 


Section 10.6 Problems in Two Spatial Dimensions 877 


Second-Order Accuracy in Time 


At Maximum Absolute Error Error Ratio rms Error Error Ratio 


1 0.1378577860 0.0311540693 

1/2 0.0096845702 14.234786  0.0031767904 9.806775 
1/4 0.0026971364 3.590686  0.0007784414 4.080963 
1/8 0.0006497332 4.151144  0.0001965845 3.959832 
1/16 0.0001559938 4.165122  0.0000497160 3.954147 


Application Problem: Two-Dimensional Model for Color Photograph Develop- 
ment 


In Section 10.3, we considered a one-dimensional model for color photograph devel- 
opment. The corresponding two-dimensional model (see Friedman and Littman in 
Chapter 4 of Industrial Mathematics: A Course in Solving Real-World Problems, 
SIAM, Philadelphia, 1994) is 


) —kTC + E(z,y), “ = —kTC, 


ar _ (ar , er 
ot Ox? Oy? 


where T(z, y) is the density of the oxidized developer and C(a, y) is the density of 
the coupler. The initial conditions for this system are T(x, y,0) = OQ and C(z, y,0) = 
Co(z,y), while the boundary conditions are T(0,y,t) = T(L,y,f) = T(z,0,t) = 
T(z, L,t) =0, where L is the length of one side of the film. E(z,y) is the exposure 
function that indicates those regions of the film that were exposed to light. 

We will simulate three minutes (180 seconds) of development time, taking as 
parameter values 


D = 100um?/s, k = 6.6 x 10!?um/moles - s, y = 7.5 x 107}? moles/ym-s 
E=1.5x108um and Co = 1.125 x 107" moles/um. 


The exposure pattern is shown in Figure 10.22. To help balance the drastic differ- 
ences in order of magnitude between the parameters, let’s introduce the following 
nondimensional variables: 


T ae Oe D x y 


PS BiA/D: CG aey L2? Ee 


In terms of these variables, the system of differential equations becomes 


OF OF OP os ac am 
a OO Oe nC ee ay. (Ee aka ere 
aE OR? + Oy? ky C+ (2,9), ot 2 ? 
where 
kL? kyL4 
hy = OR = 1.670025 x 10" and kp = = 25059375 x 10%. 


D2 
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Distance, jm 


Figure 10.22 Exposure pattern for two-dimensional color develop- 
ment example. Shaded areas represent an exposure level of 1; nonshaded 
areas represent an exposure level of 0. 


Furthermore, note that t = 180 seconds translates to t = 8 x 1077. 

The oxidized developer equation is solved using the ADI scheme, while a semi- 
implicit discretization (with the coupler density treated implicitly and the oxidized 
developer density treated explicitly) is used for the coupler equation. The resulting 
finite difference equations for each half-step of the coupler equation are 


Gin) Gi(n+2/2) 
Gant1/2) mis pik th = F3 
sa 1+ AtkoT? /2 we LAT 2 
Density profiles computed in this manner, with Ag = Ay = 1/100 and Af = 
8 x 107%, are shown in Figure 10.23. 
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Figure 10.23 Oxidized developer (top graph) and coupler density 
(bottom graph) profiles after three minutes of development time. 


EXERCISES 


1. (a) Show that the BTCS method for the two-dimensional heat equation is un- 
conditionally stable. 
(b) Show that the Crank-Nicolson scheme for the two-dimensional heat equation 
is unconditionally stable. 


In Exercises 2-5, numerically verify that 
(a) the FTCS method is stable for \ = 1/4 but unstable for any \ > 1/4; 
(b) the ADI scheme is second-order accurate in both time and space; and 
(c) the ADI scheme is unconditionally stable by approximating the solution of 


du Ou u Pu 
ot Oa? ay? 


subject to the indicated initial and boundary conditions. 


2. u(0,y,t) =1, u(l,y,t)=0, u(z,0,t) =u(#,1,t)=1—2, 
u(z,y,0) =1—2— 4 sin(2rz) sin(2ry) 


exact solution: u(z,y,t) =1—a— Ayer t=" * sin(2mz) sin(27y) 


3. u(0,y,t) = u(m,y,t) = u(z, 0, = =u(z,7,t)=0, u(z,y,0) =sinzsiny 
exact solution: u(z,y,t) = e~“‘sinrsiny 
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4, u(0,y,t) =ult.y,t) =e sing, u(z,0,4) = ule, 7, t) =0, 
u(x, y,0) = cos 2esin y 


exact solution: utz, y,t) = e® 


* cos 2zsiny 

5. u(0,y.t)=0, u(l,y,t)=y, u(z,0,4) =O, u(z,1,t) = 2, 
u(x, y,0) = ey — sina sin ary 
exact solution: u(z, y,t) = ry — e72" sin resin ry 

6. Consider the parabolic partial differential equation 

Gu dy ; 

DE =D (S + - _ Ola, y, Eu t+ SZ, Y, t), 

where A(z, y,t) > 0. 


(a) Develop the FTCS method for this problem, subject to Dirichtet boundary 
conditions along the entire boundary. 


(b) Derive the stability condition for the FTCS method developed in part (a). 


(c) Demonstrate the necessity of the stability condition derived in part (b) using 
the initial boundary value problem 


ou 1 Oru Bey 
a6 


32 + =) -ute? [s27v — ie? +y7)), 
O<2<1O0<y<lt>0 


u(0, y, t) = 0, u(1,y,t) = 8te*y” 
u(r,0,t)=0, ufa,1,th= 8te Sax" 
u(x, y,0) = sin(az) sin(vy). 


7, Rework the “Two Dimensional Model for Color Photograph Development” ap- 
plication problem using the parameter values 


D =550um?/s, b = 2.8 x 10}? am / moles -8,°y = 3.2 x 107% moles/zm-s 
L=15x 10°ym and Cp =83 x 1078 moles / jam. 


Take 
1, inside S$), but outside So 


Ete.) = { 0, elsewhere 


as the exposure function, where S$; is a square of side length 0.25 centered at 
(0.5, 0.5) and S» is a square of side length 0.1 centered at (0.45,0.55}. Simulate 
five minutes of development time. 

8. Consider flow through a square slot bounded by four flat plates. Each pair of 
parallel plates is separated by a distance A. The fluid and all four walls are 
initially at rest. At ¢ = 0, flow is initiated by impulsively bringing the lower 
wall, corresponding to y = 0, to the constant velocity uo. The velocity profile 
within the slot, u(z, y, ¢), satisfies 


ou Pui au 
a apy{ te 4+ 
at Gz? Ay* 

ulz,yQ)=0, ule, Od) = ug, ufe,n.t)=0, u(0,y,t) = ulh,y, 2) =0, 
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where v is the kinematic viscosity of the fluid. If we introduce the nondimensional 
variables 


U=u/us, X=a2/h, Y=y/h, and T=tv/h’, 


the problem becomes 

au _ au, eu 

6T 8x2" ay? 
U(X, Y,0) =0, U(X,0,T)=1, U(X,1,T) =0, U(0,Y,T) =U, Y,T) =0. 
Plot U(X, Y,T) for T = 0.05, T = 0.1, T = 0.25, and T =1. 


9. Suppose we need to solve the two-dimensional heat equation (with no source term 
and no decay term) over a rectangular domain with Dirichlet conditions specified 
along the bottom and left edges and Neumann conditions specified along the top 
and right edges. To be specific, suppose 


u(A,y,t) = ualy, €) along the left edge 
u(z,C,t)=uc(z,t) along the bottom edge 


a =aly,t) along the right edge, and 
Ou 
aa B(2,t) along the top edge. 


(a) Develop the ADI scheme to solve this problem. 
(b) Using the scheme developed in part (a), solve 


oh ah oh 
m= 457.92 (3 + sr) 
h(O,y,t) =1- ee h(x, 0,¢) = 1— cos (=) 


Oh oh 
5p, (40014 #) = 5 -(ar, 500, 2) = 0 
A(a,y,0) = 0. 


This initial boundary value problem models the change in the water table 
of an aquifer that is bounded on two sides by impermeable surfaces. ‘The 
variables h, x and y are measured in meters, while t is measured in days. 
Determine the change in the water table after 15 days. 


10. Suppose we need to solve the two-dimensional heat equation (with a source term, 
but no decay term) over a rectangular domain with a Robin condition specified 
along the top edge and Neumann conditions specified along the other three edges. 
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To be specific, suppose 


6] 
a = aaly, t) along the left edge 
Ou 
rae B(x, t) along the bottom edge 
Ou , 
Bn OB (y, t) along the right edge, and 
6) 
p(x, t)u + g(z, iS = 7(z,t) along the top edge. 


(a) Develop the ADI scheme to solve this problem. 
(b) Using the scheme developed in part (a), salve 


aT eT eT 
Popa -«(S3+5) +2 


oT oT oT 
Bn O wt) =a (0.024, yy, th= Bn (EO t) = 0 
hT(z, 0.006, ean 0.006, t) = hT oo 
T(z, y,0) = _ 


This initial boundary value problem models heat conduction in a ceramic 
plate with internal heat Beneration Take the following values for the pa- 
rameters: p = 2320 kg/m®, cp = 835 J/kg -K, k = 2 W/m- K, Q = 50 
W/m?, h = 100 W/m?- K, and Too = 30°C. Determine the hepiperseure 
distribution after 60 seconds, 120 seconds, 180 seconds, and 240 seconds. 


CHAPTER Ii 


Hyperbolic Equations and the 
Convection-Diffusion Equation 


AN OVERVIEW 
Fundamental Mathematical Problem 


The problems presented below, and throughayt the remainder of the chapter, all 
involve either the transport of some material property with a flow or the analysis 
of some wavelike phenomena. From these physical problems, three fundamental 
mathematical problems can be identified: 


The advection (or convection) equation 


Ou Ou 
Pr + a(z, t,u)a— = g(a,t,u) 


The convection-diffusion equation 


The wave equation 


au 2 Pu 


OF ° a2 


The advection equation and the wave equation are the canonical examples of 
hyperbolic partial differential equations. Numerical methods for all three of these 
model problems will be developed in this chapter. 


Pollution Transported by Groundwater Flow 


Pollution seeps into the ground from a chemical plant and is then transported 
by groundwater flow. Once in the ground, the pollutant breaks down naturally, 
but very slowly. Let C(2,£) denote the concentration (mass per unit volume) of 
pollutant in the ground « meters from the seepage site and ¢ days after the ini- 
tial contamination. To model the evolution of the concentration profile, consider 
the control volume shown below, which has length Az and cross-sectional area A 
perpendicular to the direction of groundwater flow. 
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CAy CAy, + 


Natural 
decay 


The total mass of pollutant within the control volume is CAAz. Due to 
groundwater flow, pollutant enters and exits the control volume at the rate 


a(CAv,) 
Ox pe 


respectively, where vy is the velocity of the groundwater. The rate of natural 
breakdown is proportional to the amount of pollutant present, with constant of 
proportionality a. Applying the principle of conservation of mass to the control 
volume, it follows that 


CAv, and CAvg+ 


OCA Az) _ O(C Avg) 
ey = CAv, — CAvy - Oa Az —aCAAz. 
Tf A and ug are constant, this simplifies to 
aC Oc 
a Ue ae —a. (1) 


We will assume that the ground is initially pollutant free and that the concentration 
at the seepage site is constant; that is, 


C{z,0)=0 and C(0,t)=Co (2) 


for some constant Cy. If we now divide (1) and (2) by Co and define the normalized 
pollutant concentration 


we arrive at our final model: the initial boundary value problem 


Oc Oc 
she =) ye, 3 
At + ga ac, c(z, 0) , e(0,¢) (3) 
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Traffic Flow 


Traffic on a one-lane highway through the mountains is stuck behind a slow moving 
truck. Eventually, the truck turns off the highway, and the vehicles which were 
behind the truck begin to spread out. We would like to model the evolution of the 
traffic pattern, which is described by the vehicle density, (x,t). Note that p is 
measured in cars/mile. 

Let J(z,t) denote the rate at which vehicles pass the arbitrary location x at 
time ¢. The function J is called the flux density or the flow rate. Requiring that the 
time rate of change of the number of vehicles along a representative section of the 
highway be equal to the net flux of vehicles into that section leads to the relation 


aE oe (4) 


To proceed further, we must specify a form for the flux density. 
The simplest form we can choose for the flux density is 


J =v(z,t)p(z,t), 


where v is the traffic velocity. Using this expression for J implies that drivers react 
only to local conditions; however, drivers generally pay attention to the traffic ahead 
of them. All other things being equal, it therefore seems reasonable to expect that 
the flux density will be lower when travelling into a region of higher density and 
higher when travelling into a region of lower density. Accordingly, we include the 
term -~D# into our flux density function, where D is some positive constant. Thus, 
we take ap 


J=up— De. (5) 


Substituting (5) into (4) yields 


Empirical evidence suggests that traffic velocity is a function of vehicle density. 
Consequently, we will assume that v = u(p) and define the function f(p) = u(p)p. 
Our final model for the vehicle density is then the convection-diffusion equation 


dp . Of(p) ap 


Bt Ge ae (6) 


The Transmission Line Equations 


A simple transmission line consists of a cable stretched over an extended distance. 
When grounded, the cable acts much like an electric circuit with inductance per 
unit length / and capacitance per unit length c. We will also assume that the cable 
has resistance per unit length r and experiences current leakage proportional to 
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the voltage, with constant of proportionality (per unit length) g. Our objective 
is to model the voltage, V(z,t), and the current, J(z,t), along the length of the 
transmission line. 

Consider a representative section of the cable of length Ax. The resistance of 
the cable produces a voltage drop along this section of [r Az, while the inductance 
of the cable produces a voltage drop of | Ace. Therefore, 

ol 
V(et+ Az) = V(r) -—LrAz—-1 Ata 
Bringing all of the terms to the left side of the equation, dividing through by Az 
and taking the limit as Az — 0, we obtain 
Or av 
Ii— + —+rl=0. 
at + An +rI=0 (7) 
In terms of the current flowing along this section of the cable, leakage produces 
a drop of gAxV. The capacitance of the cable produces an additional drop of 
cA ce. Thus, 


I(z + Az) = I(x) ~gAeV —cAra, 
wiich leads to aval 
Bae tae en (8) 


Equations (7) and (8) form a system of advection equations for determining V(z, t) 
and I(z,t). 

If the resistance of the cable and current leakage are negligible, then the 
equations for the current and voltage can be decoupled as follows. First, set r = 
g =0. Next, differentiate (7) with respect to ¢, multiply the resulting equation by 
cand subtract the result of differentiating (8) with respect to x. These steps yield 


ar el _ 9 
oF Oa? ~ (9) 
On the other hand, differentiating (8) with respect to t, multiplying the resulting 


equation by | and subtracting the result of differentiating (7) with respect to x 
yields 


le 


avy av 
le - =, = 0. 10 

at? = Ox? ie) 
Thus, when resistance and leakage can be neglected, current and voltage along the 
transmission line satisfy the separate wave equations (9) and (10). 


Characteristics 


In keeping with previous chapters, we will work exclusively with finite difference 
methods. No treatment of hyperbolic equations is complete, however, without at 
least some discussion of characteristics. Consider the advection equation 


Ou 


AE + eee = o(z,t,u). 


Oz 
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Suppose we can find a family of curves in the z-¢ plane, x(é), for which 


dz 
res ala, tu). 

The curves defined by this ordinary differential equation are called the character- 
istics of the partial differential equation. Along any characteristic, the solution to 
the partial differential equation, u(x,t), satisfies 


Oe oy EG Cay Leer ea 
Be ng ee ape 


If the value of the solution is specified along any curve, C’, which is not a character- 
istic, then the value of the solution away from C' can be approximated by solving 
the system of initial value problems 


a(z,t, fs x(to) = Lo 
ak ,u), u(zo, ta) = U0 


for a collection of points (ao,to) € C. This technique is known as the method of 
characteristics. 

For a more detailed discussion of characteristics, in particular their use in 
determining analytical solutions to hyperbolic partial differential equations, con- 
sult any standard partial differential equations text. Morton and Mayers [1] and 
Smith [2] present a more detailed development of the method of characteristics 
as a numerical technique, with worked examples contained in the latter reference. 
Gerald and Wheatley [8] discuss the use of characteristics for the wave equation. 


Remainder of the Chapter 


The material in this chapter is organized as follows. In Section 11.1, the upwind 
finite difference method for the advection equation is presented. A detailed analysis 
of the stability and the error of this method is presented. A second scheme for 
approximating the solution of the advection equation, the MacCormack method, 
is presented in Section 11.2. Techniques for the convection-diffusion equation are 
treated in Section 11.3. The final section addresses the wave equation. 
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11.1) ADVECTION EQUATION, t UPWIND DIFFERENCING 


The advection equation arises naturally as a model for conservation laws. Suppose a 
fluid is flowing in the z-direction. Let u(x, ¢) denote the time-dependent distribution 
o! some physical property of the fluid (such as density, temperature or momentum) 
or the concentration of a substance dissolved in the fluid. Consider an infinitesimal 
control volume. Balancing the flows into and out from the volume, as well as any 
sources and sinks, with the time rate of change of u(x,t) within the volume leads 
to 

Gu A(au 

Pa eu) = 9(z,t,u). (1) 


Here, a(x, t,u) is the velocity of the flow, and 9(#,t,u) represents the source and 
sink information. An alternative form for a conservation law is 

0 

i + of) = g(z,t,u), (2) 
where f(u) does not depend explicitly on either of the independent variables and 
represents the flux of the quantity u. 

The objective of this section is to develop the upwind finite difference method 
for the advection equation. We will start with a form that is simpler than either of 
equations (1) or (2) and then generalize. A detailed analysis of the stability and the 
error of the upwind scheme will also be presented. Toward the end of the section, 
the application of the upwind scheme to systems of advection equations will be 
briefly discussed. A second finite difference method for the advection equation will 
be presented in the next section. 


Tlhe Basic Method 


Consider the advection equation 


Ou Ou 
= + ar,t,u),— = g{z,t, u). 3 
Fe alt, t,u) Se = glat,u) (3) 
Let Ax denote the uniform spacing in the z-direction, and suppose that tay1 = 
t, + At. To approximate the solution at time level t = tn41 knowing the values 
at time level t = tn, first evaluate equation (3) at (2;,tn). Next, replace the time 
derivative with its first-order forward difference approximation. To handle the space 
derivative, we will take our cue from the characteristics of the partial differential 
equation. If e is positive, then the characteristic through (<j, tn) travels from 
left to right. This implies that information about the solution is transported from 
left to right and suggests the use of a backward difference formula for du/Ox. The 
resulting finite difference equation is 
( 

wore - wi” + Awl”? = wi) = At a . (4) 
where \ = al” At/Az. By a similar thought process, if a” is negative, then the 
characteristic through (z;,tn) travels from right to left and suggests the use of a 
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forward difference formula for 0u/Qz. 


Ze 


(n+1) {n) (n) (n)y _ (n) Fs 
ws —w; + A(w w; ) = Atg; 5 


J j+1 


“=> 


is the corresponding finite difference equation. Combining equations (4) and (5) 
yields the upwind finite difference method: 


wrt) (l- dy”) + dw!™), 4 Sigs; al) 59 
, = 


EXAMPLE 11.1 A Simple Advection Equation 


Consider the advection equation 


= +2— =0, -wK<r<wt>0 
x 


subject to the initial condition 


1, -l<2z<0 
u(x, 0) = uo(a) = { 0, elsewhere. 


Comparing the current problem with equation (3), we see that a(z,#,u) = 2¢ and 
g(z,t,u) = 0. Since the domain consists of t > 0, it follows that a” is always 
positive and the characteristics always travel from left to right. Hence, the upwind 
difference method will always use a backward difference for the space derivative. 

Since we cannot compute the approximate solution over the entire real line, we 
will have to truncate the domain to the closed interval [A, B] for some constants A 
and B. For the left endpoint of the domain, note that the initial data is zero for all 
zx < —1 and, as stated earlier, the characteristics all travel to the right. Hence, it 
would be appropriate to choose any A < —1 and introduce the numerical boundary 
condition u(A,t) = 0. All calculations below were performed with A = —2. The 
appropriate selection of B depends upon how far forward in time the solution is 
desired—we must make sure that the solution does not interact with the artificial 
boundary. For this demonstration problem, we will advance the solution to t = 3, 
and B = 11 is found to be sufficiently large to be beyond the leading edge of the 
solution. No numerical boundary condition is needed at this point because we 
always use a backward difference formula. 

Figure 11.1 displays the approximate solution at times ¢ = 1, ¢ = 2, and 
t = 3 (the solid curves) computed using the upwind difference scheme. The top 
graph was computed with Az = 1/20 and At = 1/200, while the bottom graph 
was computed with Az = 1/40 and At = 1/400. The exact solution of the partial 
differential equation, u(x,t) = uo(x — t), is indicated by the dotted curves in each 
graph. The upwind scheme propagates the initial condition to the right at roughly 
the correct speed: the center of mass of the true solution is at 0.5, 3.5, and 8.5 
at times t = 1, t = 2, and t = 3, respectively, while the center of mass for the 
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OX= 1/20, At= 1/200 


Figure 11.1 Solution of an advection equation at times t = 1, t= 2, 
and ¢ = 3. Approximate solutions, computed with the upwind difference 
method, are indicated by the solid curves, and the exact solutions are 
indicated by the dotted curves. 


corresponding approximate solutions is at 0.495, 3.49, and 8.48497. However, the 
upwind scheme also rounds over the corners of the original square wave, continually 
reduces the amplitude and spreads out the support of the pulse. 

Increasing the resolution (that is, reducing the step sizes) improves the accu- 
racy of the approximation to the extent one would expect from a first-order method. 
The amplitude errors have roughly been cut in half, as have the errors in the loca- 
tion of the center of mass. In particular, the centers of the approximate solutions 
in the bottom graph, from left to right, are located at 0.4975, 3.495, and 8.4925. 
There is algo less spreading of the solution with the smaller step sizes. 


Analysis of Upwind Differencing 


Let’s start with a stability analysis. In 1928, Courant, Friedrichs, and Lewy [1] 
developed a necessary condition for the stability of any difference scheme for the 
advection equation. The condition, which has become known as the CFL condi- 
tion, states that the domain of dependence of the partial differential equation, as 
determined by the characteristic that passes through the point (2;,tn41), must be 
contained in the domain of dependence of the finite difference equation used to com- 
pute the value at (%;,ém41). For a computational grid with spacings of At in time 
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and Az in space, and assuming a is constant, the characteristic through (2j,tn+1) 
intersects the line t = ty at c =a; — a At. For this to remain inside the domain of 
dependence of the finite difference equation of the upwind scheme, we must have 
|a At| < Az, or |A| < 1. This condition has an interesting physical interpretation: 
it dictates that once Aw has been selected, At must be chosen small enough. so that 
material is not transported more than one grid space in a single time step. 

To determine whether the condition || < 1 is sufficient for the upwind scheme 
to be stable, we perform a von Neumann stability analysis. Here, we will take 
the discrete Fourier mode to be of the form wh?) = rreii(k At) where k is the 
wavenumber. Representing the frequency of the mode in terms of the wavenumber 
and the grid spacing will simplify the error analysis that is presented below. Let’s 
start with the case of ay” being positive. For the time being, we will assume that 


g does not depend on uv. This implies that the term At a” in equation (4) will 
have no effect on stability, so can be set to zero. When, in the worked examples, g 
does depend on u, we will reexamine the stability condition on a case-by-case basis. 


Substituting the Fourier mode into equation (4) produces 
r—1+M1—e #42) = 0, 


or 


r=1l—A+de7* 4 = [1 — 1+ Acos(k Az)] — iAsin(& Az). 


After some algebraic manipulation, we arrive at 
rF = 1—2A(1 — A) {1 — cos(k Az)] = 1 — 4A4(1 — A) sin®(k Az)/2, (7) 


which is less than or equal to 1 provided A < 1. Similar analysis for the case when 
a” is negative leads to the requirement \ > —1 (recall that » is negative when 


as”) is negative). Thus |\| < 1 is necessary and sufficient for stability. Recalling 
the definition of A, the restriction on the time step is 


Ai< ai 


max; [a] , 


If we carry our Fourier analysis a little further, we can begin to quantify the 
errors introduced by the upwind scheme. Consider the function f(z,¢) = e**+¥4), 
This function is a solution of equation (3)—with a(z, t,u) constant and g{2,t,u) = 
Q—provided w = —ak. In one time step, the amplitude of f remains constant, and 
the phase changes by 

w At = —ak At = —Ak Az. 


In contrast, equation (7) indicates that, for all A 4 1, the discrete Fourier solution 
to the finite difference equations experiences a reduction in amplitude on the order 
of (k Az) in a single time step. This explains both the amplitude decay and the 
rounding of the sharp corners (higher-frequency components experience a larger 
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reduction in amplitude) observed in Figure 11.1. To account for the discrepancy in 


the speed of propagation, we examine the phase shift of the numerical mode, which 
is given by 


arctan oe = — arctan __Asin(kAg) 
Her 1—A+Acos(k Az) 


is RR lis ati — d)(1 — 2d)(k Az)? + 


Note this is also in error by an amount on the order of (k Az)? in a single time 
step. 

To explain the spreading out of the solution, we will turn to an analysis tool 
known as the modified equation. Essentially, this is the partial differential equation 
which is actually being solved when the numerical method is applied. We will 
continue to assume that a(z,t,u) is constant, and, without loss of generality, will 
further assume that a is positive. The upwind difference equation is then 


+1 
ee age Oe it = gM (8) 
At Ag a 


Substitute the true solution to the original partial ee equation, u(z, t), 


into equation (8), and expand the terms for “et! and al” ah in Taylor series about 
(25, tn): 
2 
(n+1) (n) Ou 1 2 O*u 
qn 4+—(At)? — 
’ . ot Gas 2 Ot? teat 
1 0? 
a” = ul”) — Az ae + 5 (Az) 5 . +-: 
Oe (xj str) Ol (5 ,ba) 


This procedure, upon dropping the notation for evaluation at (z;, tn), yields 
OD nee an 
— — ‘ait - 5 Ar— +-°-=@. 
at On + 2 oR ai 


The second derivative in time can be replaced by a second derivative in space as 


follows: 
deme CLES) 
ai? Ot \ AL Ot \ Ox 
0 f du af ,du\ wo 
-$-(G)-% G Ba)” Ba? 


Hence, the modified equation corresponding to the upwind scheme is 


Ou Ou Oru 
ve i 1 At 1- 
Be + Gq 9 + 30420 Nap 
The second derivative term introduces numerical (or false) diffusion and causes the 
spreading of the solution in Figure 11.1. 
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More General Formulations of the Advection Equation 
What about the more general formulations of the advection equation given by 
equations (1) and (2)? First, we will consider equation (1): 

du O(au) 

as + a 

Ot Oz 


Expanding the space derivative 


we see that the sign of a again determines the direction in which the characteristics 


travel. Thus, if al” > 0, we backward difference the space derivative and the 
resulting finite difference equation becomes 


1 
wht D = yf) 4 Ot xe = (al*) wh), = af wh") + At gl. (9) 


When a”) <0, we forward difference the space derivative, leading to 


wf ul 4 SE (aly — abt ul) 4 atl (10) 
For advection equations of the form 
du, Oflu) _ 
at a 
we proceed in a similar manner. Applying the chain rule to the space derivative, 
we find 
PEGE 0 Oe: 
Ox Ou Ox’ 


hence, the direction in which characteristics travel is determined by the sign of 
Of /Ou. The resulting upwind finite difference scheme is then given by 
yt) = way + BE | Fw) ~ (uy)] + Atg)”, Of /Ou>0 


; : : . Gd) 
wf + 2 [Fw ) — Fw] +At gs”, df/Ou <0 


EXAMPLE 11.2 An Advection Equation of the Form (1) 


Consider the partial differential equation 


du A(x) 
a = <a<]l 
at Da 0, O<a<l1t>0 
subject to the initial and boundary conditions 
2t 
x € 
u(x, 0) = Toe and u(1, t) = [ee 
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Note that a(z,t,u) = -a and 9(z,t,u) = 0. No boundary condition has been 
specified along x = 0 because x = 0 happens to be a characteristic of the partial 
differential equation. Consequently, we are not free to specify arbitrary values for 
u(0,t)}. Rather, applying the method of characteristics, we find that u(0, t) must 
satisfy the initial value problem 


du _ 

dt — 

Hence, the advection equation together with the specified initial condition deter- 
mine u(0,t) = 0 for all t. 

Figure 11.2 displays the approximate solution to this problem at ¢ = 1 and 


u, u(0,0) =0. 


t == 2 (the solid curves). The dotted curves correspond to the exact solution: 
2 
xe 
u(z,t) = ——_. 
(2,1) 1+aet 


A mesh spacing of Ax = 1/20 was selected, and the time step was chosen as the 
maximum allowed by stability considerations: 


Ag a Ag _ 1/20 1 
max; Jas | MaXze(0,1| Z 1 20° 


At = Atmax = 


EXAMPLE 11.3 An Advection Equation of the Form (2) 


Next, consider the partial differential equation 


du A(u?/2) 
= = a st 
Bt + Ot 0, co<acw,t>d 
subject to the initial condition 
= _ J 1-cos(2rr), OSa<1 
e.8) = 40le) = { 0, elsewhere. 


With f(u) = u?/2, it follows that f’(u) =u. Figure 11.3 displays the approximate 
solution at various times computed with Ar = 1/100 and At = 1/200. The time 
step was selected based on a maximum propagation speed of 2, which is the largest 
value taken on by the initial condition function, ug. Note that the initial profile 
tilts toward the right as time advances. Eventually (around t = 0.2), the right 
side of the profile becomes vertical, and a jump discontinuity is produced in the 
solution. This phenomena is known as a shock and arises due to the crossing of 
characteristics. Once the shock has formed, it propagates to the right with a slowly 
decreasing amplitude. 
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Figure 11.2 Solution of an advection equation at times t = 1 and 
t = 2. Approximate solutions, computed with the upwind difference 
method, are indicated by the solid curves, and the exact solutions are 
indicated by the dotted curves. 


Systems of Equations 


Consider the system of advection equations 


ou 4 OF(U) 
Ot Or 
where U is a vector of dependent variables and F(U) and G are vector-valued 
functions. Applying the chain rule to the space derivative yields 


0U 8U 


where J(U) is the Jacobian matrix for the function F(U). In general, it is a 
nontrivial task to use the upwind scheme on such a problem. For each time step 
and at each point of the computational grid, the eigenvalues and eigenvectors of 
the Jacobian matrix must be computed. The approximate solution must then 
be expressed as a linear combination of the eigenvectors. The appropriate finite 
difference formula, as determined by the sign of the corresponding eigenvalue, is 
next applied to each eigenvector. Finally, the eigenvectors must be recombined to 
form the solution at the next time step. It is only in the simplest cases, when the 
Jacobian is diagonal, that upwind differencing is a viable alternative for a system 
of equations. An example is presented below. 


=G, 


=G, 
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Figure 11.3 Approximate solution to the partial differential equation 


Ou O(u? /2) a 
at a 
with one period of a vertically shifted cosine function as the initial con- 


dition. 


Application Problem 1: Pollution Transported by Groundwater Flow 


Pollution seeps into the ground from a chemical plant and is then transported by 
groundwater flow toward a river one kilometer away. Once in the ground, the 
chemical pollution breaks down naturally, but very slowly. In the Overview to this 
chapter (see page 883), we found that the normalized concentration of pollutant in 
the ground, c(z,t), satisfies the initial boundary value problem 
Oc Oe 

a Oe ~ 
Here, x measures distance from the seepage site toward the river, t measures time 
from the initial contamination, vg = 10 meters/day is the groundwater velocity, 
and a = 3.6 x 1073 day~! is the natural decay rate. 

Note the right-hand side of the above advection equation contains the depen- 
dent variable, so we must reexamine stability. Since vz = 10 > 0, the appropriate 
finite difference equation is 


-ac, e¢e(#,0)=0, c({0,t)=1. 


wer - wi” + dw!” - wi) =-aAt wl, 
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Figure 11.4 Normalized pollution concentration profile, as function 
of distance from seepage site, after 50 days and after 100 days. 


where \ = v,At/Az. The amplification factor for this finite difference equation is 
r= [1 —aAt—A+ rAcos(k Az) — tAsin(k Ax); 


hence, a sufficient condition for the magnitude of r to be less than or equal to one 
is 
ne 
a Ax + Ug 
Figure 11.4 displays the normalized pollution concentration profile after 50 days 
and after 100 days. For this calculation Az = 10 meters and At = 50/51 = 0.980 
days, which is slightly below the maximum allowable time step 


Atmax = 10/(0.036 + 10) + 0.9964. 


Application Problem 2; Atr-Air Countercurrent Heat Exchanger 


A countercurrent air-air heat exchanger in a home uses the hot exhaust air from 
the furnace to heat cold air drawn from the outside. Within the exchanger, the 
two air flows travel in opposite directions, hence the designation as countercurrent. 
Consider the duct from the heat exchanger shown in the figure following. The duct 
has length Z, width w, and is separated horizontally into two chambers, each of 
height d. Cold air flows from left to right through the upper chamber with uniform 
velocity uc, while hot air flows through the lower chamber with uniform velocity uz. 
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Cold air stream 


Hot air stream 


Let C{w,t) and H(z, t) denote the temperature of the cold air and the hot air 
streams, respectively. Orient the coordinate system so that 2 = 0 corresponds to 
the cold air inlet. The heat flux (i.e., the rate at which thermal energy is transported 
by the air flow) at any location within the cold air stream is pcpw dug C, where p 
is the mass denisty and cp the specific heat of the air. The heat Aux within the hot 
air is —pcypwduy H. Assuming that diffusion effects are negligible and that heat 
is transferred from the hot air to the cold air by convection with an overall heat 
transfer coefficient hg, a control volume analysis leads to the system of advection 
equations 

OC oC oh 
BE + VC ae = pepdl 
OH OH h 


OE — ln = mea 


For simulation purposes, suppose L = 3 meters, w = 0.4 meters, and d = 0.25 
meters. The initial temperatures are C(z,0) = H(z,0) = 20°C, and the boundary 
conditions are C'(0,t) = 5°C and H(3,t) = 60°C. For the remaining parameters, 
let’s take vg = vy = 0.1 meters/second, h = 20 W/m? -K, p = 1 kg/m®, and 
Cp = 1007 J/kg K. Though this is a system, it is clear that the @C/Oz term 
should be backward differenced and the 0H/0z term should be forward differenced. 
Figure 11.5 displays the approximate temperature profiles within the heat exchanger 
in ten second increments. The solid curves represent the cold air stream, and the 
dotted curves represent the hot air stream. All calculations were performed with 
Az = 1/100 meters and At = 1/20 seconds. 

As in the case of the groundwater pollution problem earlier, the presence of 
decay terms on the right-hand sides of the partial differential equations requires 
that the stability condition be recalculated. Following standard procedures, we 
find that the maximum allowable time step is given by 


Az 


At =, 
mex pAx/pepd + max(ug, vx) 


For the specified parameter values, At ~ 0.0992. A smaller value has been chosen 
to reduce the truncation error. 
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Figure 11.5 Temperature profiles of the cold (solid curves) and hot 
(dotted curves) air streams in an air-air countercurrent heat exchanger. 
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EXERCISES 


In Exercises 1-4, apply the upwind scheme to the given advection equation. Truncate 
the domain and introduce numerical boundary conditions where appropriate. Compare 
the approximate solution to the indicated exact solution. 


Ou lta? Ou 


i ee eg §>0, 
dt * (+a)? + 2c Ox eae ae 


1, 05<2<1 

u(z,0) = uo(z) = { 0, elsewhere 

exact solution: u(z,t) = uo (« = ==) 

ms 2, O<a<il 
2. a tet i =-1, 2>0,t>0, u(x,0)=uple) =< 2-2, 1<a<2 

it Q, elsewhere 
2 ati oe 
eacveduton aes { ~ [e“*"(w@ +1 +8) — 1] -2, aoe ey 
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du A(a*u) 


Bp Tag =O O< a <4t>0, ufz,0)=0, uaa t? 


£6 2 
exact solution: u(z,t) = { ar(g-gtt), @>4/(4t+1) 
0, e<4/(4¢+1) 
ou 0 


2 
; OL Be la-2 Ju] = —22u, -l<ac lt > 0, u(x, 0) = 1 + cos(rz) 


-2t 
exact solution: u(z, t) = ug ec) loner 
l—-a+(l+z)e7 


» Repeat the “Pollution Transported by Groundwater Flow” application problem 


with At = 1 day. Observe the instability in the resulting solution. 


. Repeat the “Air-Air Countercurrent Heat Exchanger” application problem with 


At = 0.1 seconds. Observe the instability in the resulting solution. 


. Show that the restriction 


Ag 
aAz + ug 


is sufficient to guarantee that the magnitude of r is less than or equal to one, 
where 


At< 


r= [1~aAt— + Acos(k Az)| -— iAsin(k Az) 
and A = u,At/Az. 


. Consider the advection equation 


where both a and g are constant, 

(a) Derive a finite difference method for this problem using a first-order forward 
difference for the time derivative and a second-order central difference for 
the space derivative. 

(b) What restriction on At is required for the method developed in part (a) to 
satisfy the CFL condition? 

(c) Use von Newnann stability analysis to show that the method developed in 
part (a) is unconditionally unstable. 


. In a particular heat exchanger, steam is used to heat water as it flows through 


a four meter long pipe, also known as a heat exchanger tube. The steam main- 
tains a constant temperature, T, = 100°C, along the tube. The time-dependent 
temperature distribution of the fluid, T(x,t), satisfies 
or yor _ 2h 
Ot Gx pCpT 


(Ts =F T), 


where it has been assumed that there is perfect thermal coupling between the 
fluid and the heat exchanger tube. Here, v = 0.1 m/s is the velocity of the water 
in the tube, h = 1000 W/m? - °C is the heat transfer coefficient, p = 989 kg/m® 
is the density of the water, cp = 4180 J/kg -°C is the heat capacity of the water, 
and r = 30 mm is the radius of the tube. Initially, the inlet temperature of the 
water is T'(0,£) = 25°C and the temperature distribution along the tube is 


2hz 
T(z, 0) = 100 — 75 exp (~.) : 
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If the inlet water temperature is abruptly increased to T(0,t) = 35°C, approxi- 
mate T(z, t) at 10 second intervals from t = 0 through t = 100. 


10. A 2-meter-long air-air heat exchanger in a home uses the hot exhaust air from 
the furnace to heat cold air drawn from the outside. Within the exchanger, the 
two air flows travel in the same direction. Let C(x,t) and H(z,t) denote the 
temperature of the incoming cold air and the exhausted hot air, respectively. 
These functions satisfy the system of equations 


oe oC h 
eae pac Rape reHaoe fF 2 gaat 
Ot + Be naa” @) 


on, OH __h 
ot "On pepd 


(C-#) 

subject to the initial condition C(z,0) = H(z,0) = 20°C and the boundary 
conditions C'(0,t) = 5°C and H(0,£) = 60°C. vo = vy = 0.1 meters/second is 
the velocity of the air through the heat exchanger, h = 40 W/m?-K is the overall 
heat transfer coefficient between the two air streams, p = 1 kg/m? is the density, 
and cp = 1007 J/kg - K the heat capacity of the air, and d = 0.25 meters is the 
height of each chamber of the heat exchanger. Approximate the temperature 
profiles in each air stream at five second intervals from ¢ = 0 through ¢ = 25. 


Exercises 11-13 deal with the “Donor Cell” method [2], which is a variation on the 
upwind scheme. For the form of the advection equation given by (1), the corresponding 
finite difference equation is 


(ntl) _. (n) , At (n) 
Ww; =U; + Ay (arws — aRwe) + Atg; ‘ 
where 
a + a a + ae on apr >d 
ae er REET cay 
2 Wi, aL <0 
and 


in) 


wl”, ar >d 
WR= ( 
Wier OR <0. 


11. If a@ is constant, show that the donor cell method reduces to the upwind scheme. 


12. Apply the donor cell method to the initial value problem in Exercise 3 and 
compare the approximate solution to that obtained using the upwind scheme. 


13. Repeat Exercise 12 for the initial value problem in Exercise 4. 
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11.2 ADVECTION EQUATION, Il: MACCORMACK METHOD 


Though the upwind scheme has the benefit of being based on the analytical concept 
of characteristics, the method is only first-order accurate and difficult to implement 
for systems of equations. In this section, the MacCormack method for approximat- 
ing the solution of the advection equation will be presented. This is a second-order 
method that is straightforward to extend from a single equation to systems of equa~ 
tions. 


MacCormack Method 


The MacCormack method is a predictor-corrector scheme for approximating the 
solution of advection equations. Suppose the solution is known at time level t = t, 
and we want to advance to time level t = t,,,. With the MacCormack method 
we first replace both 0u/t and Gu/Ox by first-order forward difference formulas 
to predict the value of wr, Denote this predicted value by wj. In the corrector 
step, we again use a first-order forward difference formula for Gu/Ot. The space 
derivative is discretized as the average of a forward difference approximation com- 
puted using values at t = ¢, and a backward difference approximation computed 
using the values obtained from the predictor. The right-hand side of the differential 
equation is also handled as an average of values at the old and new time levels. 
To be specific, consider the scalar advection equation. 


au aa a = 
a oc 9 
‘The predictor step takes the form 
wi ws 4 ah” wy Sa gi” 
At 3 Ag dt 
or (nr) (n) 
wh = (1+ Aw — rwh), + deg”, (1) 


where A = a! At] Az. The corrector step is then 


n ve j 
wt) — wl 1 | egw - wf” Sg IA Ne Tg gh. (2) 
Ape - alee, age ie: | ee 


j denotes the function a(z,t,u) evaluated at x = 2, ¢ = tny1 and u = = Wj 95 
is ne defined. If we solve the predictor equation for ws a substitute the 
result into equation (2), solve for wi +1) and simplify, the corrector equation can 
be written as 


(n+1) [ws 7) Lay — "(wh — wh +5 Atg;, (3) 


G 
mopEe 


where \* = af At/Ag. 


Section 11.2 Advection Equation, Il: MacCormack Method 903 


For the more general advection equation 


du n O{au) _ 
at Ox rae 


the predictor step of the MacCormack method becomes 


ae n Aé nr nr 
uj = ul) — SE (000), = aul) + arg ) 


The corresponding corrector equation can be written as 


1 | * At * OK * * 1 x 
wir Ds ; {uf + w; ie (a5w3 05 105..)} + pati. (5) 


To arrive at equation (5), we must perform the same manipulations that were 
required to obtain equation (3). Finally, for an advection equation of the form 


du -Of(u) 
Ot ‘ Ope 
the predictor and corrector equations are 
ie At 
wy = uy — (HP -) + Atay” (6) 
and , re 
wet) =%5 [ws + wi Agi ~ I5- sa] +5 + sAtg;, (7) 


respectively. 

Although we have used first-order finite difference formulas in both the predic- 
tor and the corrector, the contributions to the truncation error cancel and produce a 
scheme which is second-order in both At and Az. What about stability, amplitude 
and phase errors and numerical diffusion? For our additional analysis of the Mac- 
Cormack method, we will assume that a is constant [alternatively, that f(u) = au 
for some constant a] and that g is identically zero. With these assumptions, upon 
eliminating the intermediate values w? from equations (1) and (3), we find 


wht) = 2 fal 4 (1 adel”) — dea) 


d (a + dw — dw), — (1+ Awl), + rev! | 


i 


tee 


Des (421 = 7)” + 20-1) | (8) 
5 (1+ A}uy WP) ie Dn 4 20 1p, 


where \ = aAt/Az. We arrive at the same end result using equations (4) and (5) 
or equations (6) and (7). 
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Since the characteristic through (tj,tn41) intersects t = t, ab c= zr;—aAt 
and equation (8) uses data from Xj-1, Tj and x;41 to compute we, we must 
have [a Ad] < Az, or [A] < 1, to satisfy the CFL condition. Next, we substitute 


the discrete Fourier mode wh") = reti(FA2) into (8). The resulting formula for the 
amplification factor is 


r=1—)? +d cos(k Az) —idsin(k Az) 
kA 
= 1-2)" sin? a — idsin(k Ax), 


from which we calculate 


Ax]? 
Th = E — 2d? sin? es + d* sin?(k Az) 


= 1-4)7(1 ~ )*) sin! Nee. (9) 
Eguation (9), together with the CFL condition, indicates that |A| < 1 is necessary 
and sufficient for stability. From (9), we also find that one time step of the MacCor- 
mack method introduces an amplitude error on the order of (k Ar)4—significantly 
smaller than the amplitude error of the upwind scheme. One time step of the 
MacCormack method also changes the phase of the Fourier mode by 


arctan ie arctal a sin(k a) 
=e n 
Rer 1 — 2X2 sin?(k Az/2) 


~ Ak Ag J1- Ul SMR AZ? $8! |, 


which is in error by O ((k Az)*). The modified equation associated with (6) is 


Ou Ou 1 

a “ac 
Since this contains no second-derivative term, the method does not produce nu- 
merical diffusion. On the other hand, the presence of the third derivative term 
introduces what is known as numerical dispersion. This phenomena manifests it- 
self as oscillations, or wiggles, before and after sharp wave fronts. The details of 
the derivation of the modified equation are left as an exercise. 


EXAMPLE 11.4 Two Sample Advection Equations 
Consider the advection equation 
Ot of Oo, -w<24<w,t>d 
Ot Ox 


subject to the initial condition 


1, -l<2<0 
uf, 0) = uo(s) = { 0, elsewhere. 
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Figure 11.6 Performance of the MacCormack method on two sam- 
ple advection equations. In each graph, the approximate solutions are 
plotted as solid curves and the exact solutions as dotted curves. 


Here a(z,t,u) = 2t and g(z,t,u) = 0. As in Section 11.1, we will advance the 
approximate solution to ¢ = 3 and truncate the domain to the closed interval 
[-2, 11]. Though the resulting mathematical problem requires a boundary condition 
at x = —2 only, the MacCormack method needs a boundary condition at each end of 
the truncated domain. Examination of the initial condition and the characteristics 
indicates that the appropriate boundary conditions are u(—2,t) = u(1l,t) = 0. 

The top graph in Figure 11.6 displays the approximate solution at times f = 1, 
t = 2 and ¢ = 3 (the solid curves) computed using the MacCormack method with 
Ac = 1/20 and At = 1/200. The exact solution of the partial differential equation, 
u(x,t) = uo(z—t”), is indicated by the dotted curves. With fourth order amplitude 
errors and no numerical diffusion, the MacCormack method maintains both the 
height and the width of the initial pulse much better than does the upwind scheme 
(compare with Figure 11.1), However, numerical dispersion introduces oscillations 
which trail behind each discontinuity in the solution. Thus, on this problem, the 
performance of neither the upwind scheme nor the MacCormack method is entirely 
satisfactory. 

The performance of the MacCormack method on this second example problem 
is, on the other hand, clearly superior to that of the upwind scheme. The bottom 
graph in Figure 11.6 displays the approximate solution (solid curves) to the initial 
boundary value problem 

Ou A(xu) z et 


=0 <a2<l = — — 
A An , O<e<1,t>0 ufz,0) iw u(1,t) ee 
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computed using Av = 1/20 and At = 1/20. At # = 0, the advection equation 
together with the initial condition determine u(0,t) = 0 for all t. To the resolution 
of the graphics device, the approximate solution is almost indistinguishable from 
the exact solution, 


ert 


~ V4 gel? 


shown by the dotted curves. Compare this with the performance of the upwind 
scheme in Figure 11.2. 


u(z, t) 


EXAMPLE 11.5 Resolution of a Developing Shock 
In the previous section we found that the characteristics of the initial value problem 


du 8(u? /2) 
a Oz 


=0, -m<2a<0,t>0 


1—cos(2mz), O<a<1 
G, elsewhere 


u(z,0) = ug(x) = { 


intersect, producing a sbock and introducing a jump discontinuity into the solu- 
tion. Figure 11.7 displays the approximate solution to this mitial value problem at 
various times, computed with Ax = 1/100 and At = 1/200. This problem demon- 
strates one of the main reasons that the MacCormack method has become popular 
in practice: for nonlinear partial differential equations, the MacCormack method 
provides high resolution of propagating fronts. 


Systems of Equations 


Another major advantage of the MacCormack method is the simplicity with which 
it can be extended to systems of equations. Essentially, little more than a change 
in notation is needed. In particular, consider the system of advection equations 


aU , oF(U) 
Ot és 


where U is a vector of dependent variables and F(U) and G are vector-valued 
functions. For conservation laws of this type, equations (6) and (7} become 


(n) (n) 
es ) + ate! 


and 
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Figure 11.7 Resolution of a shock front in a nonlinear advection equa- 
tion. 


Application Problem 1: Stuck Behind a Slow-Moving Truck—A Simplified 
Model 


In the Overview to this chapter (see page 885), we developed a model for the vehicle 
density, p(z,t), on a stretch of highway. That model consisted of the convection- 
diffusion equation : 

ap, afte) _ do 

Ot Or Ox? 
where f(o) = v(p)p, v(p) is the density-dependent traffic velocity and D is a con- 
stant. Here, we will consider the case D = 0; that is, we will work with the 
simplified model 


Op  Of(p) _ 
BE ig aan 0. 
Specifically, suppose 
40, z<-l 
p(z,0)= § 100+60cos(rz), -l<2<i 
40, z>1, 


which we use as an approximation to the density profile surrounding a slow moving 
truck located at = 0. At t = 0, the truck exits the roadway, and we want to 
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determine the evolution of p. For the function f, we take 


p p \* 
1-0.7 — 0.3 ( ) | ; 
Pmax Pmax 


where Vinax = 50 mph is the speed limit on the road and pmax = 400 cars/mile. 
This particular function is based on the assumptions that the flow of cars goes 
to zero at both extremes of car density, f’(0) = Vinax and there is some density, 
0 < px < Pmax, for which f achieves a maximum (see Fowkes and Mahoney [1]). 

Truncating the domain to the interval [—2,5] and choosing mesh spacings of 
Ag = 0.01 miles and At = Az/Vnax = 2 x 1074 hours, we obtain the density 
profiles displayed in Figure 11.8. Profiles are shown 0.6 minutes, 1.8 minutes, 3.0 
ainutes, and 4.2 minutes after the truck has exited the highway. Note the formation 
of a shock and the numerical dispersion trailing behind the jump discontinuity. 


f(p) = pVnax 


Application Problem : Free Surface Motion of Water in a Basin 


A rectangular water basin, L meters long and W meters wide, opens on one end to 
a reservoir (see Figure 11.9). The water level in the reservoir changes over time, 
which causes the water level within the basin to change. By treating the basin 
dynamics as quasi-one dimensional flow and considering conservation of mass and 
conservation of momentum, Roberson and Crowe (Engineering Fluid Mechanics, 
3rd edition, Houghton Mifflin Company, Boston, 1985) develop the system of partial 
differential equations 


dh | HAV) _ 
at ox 
O(hV) 4 O(hV? + gh?/2) ce VV] P 
ot Ox 2 W 


for the basin water surface profile, h, and the horizontal velocity of the water in the 
basin, V. The additional parameters in these equations are the acceleration due to 
gravity, g, the shear-stress coefficient, cr, and the wetted perimeter P = W + 2h. 
The initial conditions are given by A(x, 0) = ho and V(x,0) =0. At the open end 
of the basin, h(L, t) is known, while at the closed end of the basin V(0,t) = 0. 

To apply the MacCormack method to this problem, we must first express the 
system in vector form. Since the system contains time derivatives of h and hV, we 
let 


Then 


F(U) = AV? ga = vin gh? /2 | < u/ us tout? | 
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Figure 11.8 Vehicle density profiles after a slow moving truck exits 
the highway. 


Figure 11.9 Rectangular water basin open at one end to a reservoir. 
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and 


6=| _anviepw || -caavmie 
—esV|V|P/2W | ~ | —cp(hV/A)|(AV/h)|P/2W 


7 | Seale euleay 


The initial condition becomes 
u(2,0) =|‘ | 


and the boundary conditions are given by 


known 


v(E.t) =| : ] ana vot)=[ 4 |. 


Note that at each end of the domain we need to specify an additional boundary 
condition. The simplest scheme for doing so is to set the first difference of the 
unknown quantity to zero; that is, after computing the values of the approximate 
solution at all interior grid points, set, 


ug(L,t) = ue(L—-Ag,t) and wu, (0,t) = u,(Az, t). 
Suppose the basin is 60 meters long and 15 meters wide. The shear-stress 
coefficient is cy = 0.01, the acceleration due to gravity is g = 9.81 and the initial, 
uniform depth of the water in the basin is ho = 5 meters. Take 


h(L,t) = 5 + 0.5sin(0.1£) meters 


as the varying depth at the open end of the basin and Az = 0.6 meters. The 
maximum allowable time step for this problem is 


Az 
|Vinax| + Vinee. 


where |Vinax| + /gltmax 38 the spectral radius of the Jacobian associated with F(U). 
Conservatively estimating |Vinax| = 1 and Amax = 10 yields 


Atmax = 


0. 
Riggs 2 epee 


1+ V¥98.1 


Therefore, let’s use At = 1/20 seconds. Simulation results are displayed in Fig- 
ure 11.10. The top graph shows the free surface profiles after 20, 40, and 60 seconds. 
The bottom graph indicates the depth of the water, as a function of time, at the 
open and closed ends of the basin. 
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Figure 11.10 Motion of free surface of water in a basin which is open 
at one end to a reservoir. (Top graph) Free surface profiles at various 
times. (Bottom graph) Water depth as a function of time at the open 
and closed ends of the basin. 
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EXERCISES 
1. Derive equations (5) and (7). 
2. Suppose that a is constant. Show that equations (1) and (3) are identical to 
equations (4) and (5), which are, in turn, identical to equations (6) and (7). 


For Exercises 3-6, apply the MacCormack method to the given advection equation. 
Truncate the domain and introduce numerical boundary conditions where appropriate. 
Compare the approximate solution to the indicated exact solution. 


Ou 1+2? Ou 
— - t>0 = = 
dt * (1+ 22)? + ta: Ax 0 ees A 
1, 085<a2<1 ae =, t 
{ Ox celesaneee exact solution: u(a,t) = uo (« Sis -;) 
au Ou z, 0O<a2<1 
4, ap ets hy z>0,t>0, u(z,0)=uole)=< 2-2, l<a<2 
t £ 0, elsewhere 
“83 _ J uo le*@+14+t)-1]-t, 2>eb-1-2 
exact solution: u(«, ¢) = { 0, a<e'—1-t 
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du A(x? u) 
ai “Bs =0, 0<2<4,t>0, ulz,0)=0, u(4,t)=2? 
; (4-242)? A/(4t-+1 
exact solution: u(z,t)= 4 2? \d 2 yo, @>4/(4t+1) 
0, x <4/(4t41) 


wy ofa | = 2 1 
Be Be a*\n] = -20u, -l<a2<l,t>0, u(z,0) = 1+ cos(xz) 


exact solution: +z, t) = ug ( 


(l+aje"%* — (1-2) 
l-2+(14+z)e—% 


- Numerically verify that the MacCormack method is second-order in Ax and At 


by approximating the solution of the initial boundary value problem 


du —- (au) 5 x et 
at Ba = 0, QO<a<1t>0, te Oy BO eee 


The exact solution for this problem is 


gett 


UN) ence 


8. Repeat Exercise 7 using the initial value problem in Exercise 6. 


9. Numerically verify that the MacCormack method is better than first order, but 


10. 


11. 


12. 


not second order, in Ax and At? for the initial boundary value problem in Exer- 
cise 5. Why is the method not second order for this problem? 


Reconsider the “Stuck Behind a Slow-Moving Truck—A Simplified Model” ap- 

plication problem. 

(a) Estimate the amount of time needed for the peak of the car density profile 
to drop to 100 cars/mile. 

({b) If the initial density profile is taken to be 


70, z<o 
p(z,0) = ¢ 100- ren cos(wr), AO<2<2 
40, x > 2, 


calculate the density profile after 0.6, 1.8, 3.0, and 4.2 minutes. Estimate the 
amount of time needed for the peak of the profile to drop to 110 cars/mile. 
For the function F(U) as defined in the “Free Surface Motion of Water in a 
Basin” application problem: 
(a) Compute the Jacobian matrix J(U). 
(b) Determine the eigenvalues of J(U). 
(c) Show that p(J) = |View. | + VGRinax: 
Repeat the “Free Surface Motion of Water in a Basin” application problem for a 
basin that is 50 meters long and 20 meters wide. Take h(L,t) = 5+0.75 sin(0.06t) 
and leave all other parameter values unchanged. , 
(a) Determine the free surface profiles after 25, 50, 75, and 100 seconds. 
(b) Determine the water depth as a function of time at both the open and the 
closed end of the basin for £ < [80 seconds. 


Section 11.2 Advection Equation, Il; MacCormack Method 913 


13. If the bottom of the basin is sloped, then the equations for the free surface motion 
become 


dh a(hV) 


7 ORV) , (AV? + gh? /2) _ egV|V| P 
Me ee ee Be alae ae 


where S is the slope of the bottom surface. Repeat the previous exercise with 
S = 0.04, assuming that the initial depth of the water is 3 meters at the closed 
end of the basin and 5 meters at the open end. 

14. In the Overview to this chapter (see page 885), we showed that the voltage, 


V(a,t), and the current, I(z,t), along a transmission line satisfy the system of 
advection equations 


Ol OV 

ee EP eG 
lat Oz a 

OV al 

=< os VY= 
ae) +g 0, 


where |, r, c, and g are the inductance, resistance, capacitance, and current leak- 
age per unit length of the transmission line, respectively. Suppose the line is 100 
meters long with / = 1.2 henries/meter, c = 0.3 farads/meter, g = 0.0038 sole 
and r = 0.1 ohms/meter. The initial conditions are 


I(x,0) = 5.5sin 7 and V(z, 0) = 100sin =, 


while the boundary conditions are 
V(0,t) =V(100,4)=0 and 1(0,t) = (100, t) =0. 
Determine the voltage along the line after 30, 60, 90, and 120 seconds. 


15. Show that the modified equation associated with the finite difference equation 


wns * (1+ A}u! wi, + (1 Aaa + 20a — ah”) 


3 g41 
is 
Ou du a, Pu 
ep Coe = ~ a(n)? (1- a3 5 
Exercises 16-18 deal with the Lax-Wendroff method, which for the advection equation 
ou , Of Oflu) _ 9 
ot Oz : 


is given by 


a3 nr A i? n n 
wrt) mal) — 2) = $0) + [anal = £0) = dy aol — 1], 


where ie 1 
A= An’ and aj+1/2 = f’ (G2, +u(")) 5 
16. Suppose f(u) = au for some constant a. 
(a) Perform a von Neumann stability analysis on the Lax-Wendroff method. 
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(b) What are the amplitude and phase errors associated with the Lax-Wendroff 
method? 

(c) What is the modified equation associated with the Lax-Wendroff method? 
Do you expect the method to produce numerical diffusion and/or numerical 
dispersion? Explain. 


17. Apply the Lax-Wendroff method to the initial value problem 


du, A(u*/2) 

GL og foe Rea 
1—cos(2mx), O<a2<1 

ul2,0) = wots) = { 0, oe siece 


Truncate the domain to [—0.5, 1.5] and use Az = 1/100 and At = 1/200. Com- 
pare the solutions at t = 0.10, ¢ = 0.15, t = 0.20,t = 0.25,t= 0.30, and t = 0.35 
to those obtained from the MacCormack method (see Figure 11.7). 

18. Apply the Lax-Wendroff method to the initial value problem in the “Stuck Behind 
a Slow-Moving Truck—A Simplified Model” application problem. How do the 
solutions compare to those obtained by the MacCormack method? 


11.3) CONVECTION-DIFFUSION EQUATION 


For the physical phenomena that can be modeled by the advection equation, such as 
contaminant transport, traffic flow, and heat flow, it is not uncommon for diffusion 
effects to be as important as convective (transport) effects. The objective of this 
section is to extend both the upwind differencing scheme and the MacCormack 
method to handle such convection-diffusion problems. 


Upwind Scheme 
Consider the partial differential equation 


de , (ue) _ de 
ot Ox Ox? 
which includes both a convection term, O(uc)/Oz, and a diffusion term, 07¢/Oz?. u 
is the velocity of the underlying flow, which need not be constant, but is assumed 
to be independent of t. For the upwind scheme, we apply a first-order forward dif- 
ference approximation for the time derivative and a second-order central difference 
approximation for the diffusion term. For the convection term, we apply either 
a first-order forward difference or a first-order backward difference approximation, 
depending on the sign of wu. 
When u > 0, ¢ is being transported from left to right by the flow, so a 
backward difference is the appropriate choice for discretizing the convection term. 
The resulting finite difference equation is 


(1) 


Begs = ay a (ue) = (ue), a se = 2c” + a 
At Ag (Ag)? : 
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or 
n+ n At n n n n n 
ef a cf ) a ee [(w c) is > (ue)$ | + 10) (a, ) _ 2c! ) + ef ») ; (2) 


where pp = DAt/(Az)*, When u < 0, a forward difference is the appropriate choice 
for discretizing the convection term, leading to the finite difference equation 
+1) _ (rn), At 
ee =e" ips i. [(ue)$” - (uc ied +p (ee aie 20\") a ay). (3) 
Equations (2) and (3), together, comprise the upwind differencing scheme for the 
convection-diffusion problem given by (1). 
To assess the stability of the upwind differencing scheme, we will assume that 
w is constant and first consider the case when u > 0. With the assumption of 
constant flow velocity, equation (2) becomes 
om = ae + [ef - a”) + yu (co - — 2¢; ome of") ) ; (4) 
where ) = uAt/Az. Substituting the discrete Fourier mode ae = re) into 
equation (4) leads to 


r=1+X(e* ? _ 1) 4 Qu(cos@ — 1) _ 
=1-A— 2st (A+ 2p) cos — irsin gd. 


Note that as @ ranges from —7 to 1, the amplification factor traces out an ellipse in 
the complex plane. The center of this ellipse is located at the point (1— A — 2y,0), 
the horizontal semiaxis of the ellipse is of length 4+ 2u and the vertical semiaxis 
is of length ». For stability, the ellipse must lie entirely inside the unit disk. This 
requires A < 1 so that the ellipse will not go beyond the top or bottom of the disk 
and -1 < 1 ~— 2A — 4p, or 4+ 2 < 1, so the ellipse will not pass the left side of 
the disk. The latter inequality poses the more restrictive condition, hence we need 
A+ 2u <1 for stability. If we follow precisely the same steps for the case when 
u <Q and combine our results with the u > 0 case, we obtain the final stability 
requirement: 
|Al + 2u <1, 


or, in terms of the time step, 


lal. “pe NS 
[es tH 
Ais (# : (Ax}? 


The modified equation associated with (4) is given by 


Oe dc 


a, tug =|D+ : Az(l — A) ae + 
ot dx he Ox? 
(see Exercise 1). In other words, the upwind scheme produces an effective diffusion 


coefficient of 1 ae 
D+ 3 Az(1-— A). 
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Since we need \ + 24 < 1 for stability, it follows that 1 — A > 2u. The increase in 
the diffusion coefficient is then at least 

1 DAt ut 

—uAr:2u= pudsc = u Az = D— = Dd. 

2 ies (Az)? Az 
Thus, 4 provides a lower bound on the percentage increase in the diffusion coeffi- 
cient. 


MacCormack Method 


The MacCormack method can also be easily extended from the advection equation 
tc the convection-diffusion equation. Working from equation (1), the predictor step 
uses & first-order forward difference approximation for the convection term and a 
second-order central difference approximation for the diffusion term. This yields 


« _ Am) n 
Goo (ue), - (uc) 2 pou _ ac\”? + en 
At Ag (Ax)? : 
or eS 
Cj = a + Ag [wey ae (ue)? + pb Ce = ae” oe a . (5) 


c} is used to denote the predicted value for or) and, again, up = DAt/(Az)?. 
For the corrector step, the convection term is discretized as the average of a for- 
ward difference approximation computed using values at ¢ = t, and a backward 
difference approximation computed using the values obtained from the predictor. 


The diffusion term is handled in exactly the same manner, thus producing 


le = a” i [ae = (ue)? ae (ue)} - a _ 


aT ne Az 
D Oe 2c\”) + on CF 265 + gy 
2 (Az)? (Azy? 


Eliminating the term 
ait) ay ars?) 
(wr)yg 7 Od; 
Az 
by making use of the predictor equation, the corrector equation can be expressed 


as 


ei LLG 465 [luc (uefa) +a 25 +Gu)} 
Hoffman [L| has determined that this method is stable provided |A| < 0.9 and 
wu < 0.5, where A = uAt/Az. As with the advection equation, the MacCormack 
method does not introduce numerical diffusion into the approximate solution for 
the convection-diffusion equation, but may still introduce numerical dispersion (i.e., 
oscillations in the vicinity of a propagating front). 
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A Worked Example 


EXAMPLE 11.6 A Sample Problem 
Consider the convection-diffusion problem 


Oc Oe ae 
+ 0.1-—- =D = = (¢0) =0. 
3 De Ay2? e(0,t)=1, ¢(2,t)=0, efx,0) =0 
For all calculations, we will use Ax = 1/20. Let’s start with D = 0.025 = 1/40. 
For the upwind scheme, we will take the maximum allowed time step: 


-1 


1/20 1/400 22° 
With this value for Af, it follows that 


y — (aftoy(/22) 2b 


Woo 
so there will be less than a 10% increase in the diffusion coefficient. The time step 
for the MacCormack method must satisfy 
1/400 1 (9/10)(1/20) 9 
Aimacy <> = ax and «At Se ee 
MEE = 57 jap). 30. ee BaD 20 
we will therefore choose AtMac = 1/20. The top graph in Figure 11.11 displays the 
approximate solutions at t = 4. Note there is only a slight increase in the diffusion 
of the front as calculated by the upwind scheme. 
Next, suppose D = 0.0025 = 1/400. The maximum allowed time step for the 
upwind scheme is then 


_ f1flo  2(1/400)\7* artes 
mig Gem yoo) PF = 7 


for which 


(1/10)(1/4) _ 1 
man yD: 
In this case, there will thus be a 50% increase in the effective diffusion coefficient. 
For the MacCormack method, we must have 
1/4001 (9/10)(1/20) 9 
Aimacun < = = - d At < = 
Mee TaD), 2 TO 20 
Since we still want to advance the solution to t = 4 in an integer number of time 
steps, we will use Atmac = 4/9 (i.e., we will take 9 time steps). The approximate 
solutions with D = 0.0025 = 1/400 are shown in the bottom graph of Figure 11.11. 
As expected, the upwind scheme produces substantially more diffusion of the front 
in, this case. 


A= 
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Figure 11.11 Approximate solution to a sample convection-diffusion 
problem computed with the upwind scheme and the MacCormack 
method for different values of the diffusion coefficient. 


Application Problem 1: Passive Contaminant Transport 


A smoke stack releases 8 passive contaminant (i.e., a chemical that does not react 
with air) into the air. Once released, the contaminant is transported to the sur- 
rounding area by prevailing wind currents and by diffusion. Suppose we consider 
only one-dimensional transport, and let c{x,t) denote the normalized contaminant 
level as a function of distance from the smoke stack, 2, and time, t. Performing 
a control volume analysis similar to the one used for the groundwater pollution 
problem in the Chapter 11 Overview, we find that cfx,t) satisfies the convection- 
diffusion equation 


a Ge a2? 
where w is the wind velocity and D is the coefficient of diffusion. 

Let’s take D = 1.3 x 1074 miles?/hour and u = 2.5 [1 + e~*/* cos(1.712/5)| 
miles/hour. Note that the velocity attains a maximum of 5 miles/hour at the smoke 
stack and decreases to a minimum of roughly 1.09 miles/hour at 2 = 2.77 miles. 
We will assume that the air is initially free from contamination (c(*,0) = 0} and 
that the contaminant level at the smoke stack is constant (c(0,t) = 1). 

Figure 11.12 displays contaminant profiles at £ = 30 minutes, t = 1 hour 
and ¢ = 2 hours computed using both the upwind scheme {top graph) and the 
MacCormack method {bottom graph). Az was selected as 0.01 miles. The time 


dc A(uc) _ pf 
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Upwind scheme 
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Figure 11.12 Contamination profiles for passive contaminant released 
into the air from a smoke stack. 


steps, in hours, were taken as 


eat 5 2.6 x 10-4 
we = oq <\o017 10x 10-4 


for the upwind scheme, and 


1 _ (0.01)(0.9) 


Mac — tap = At 
Atm 560 5 Mac, 


for the MacCormack method. For this problem, the stability condition for the 
MacCormack method as derived from the convection term is significantly more 
restrictive than the condition derived from the diffusion term. 

Note that the peak contamination level at t = 2 is between four-and-a-half 
and five times the contamination released into the air. This build-up occurs be- 
cause the wind velocity initially decreases as we move away from the smoke stack. 
The extent of the build-up can be expected to decrease with either an increase in 
the diffusion coefficient or a reduction in the difference between the maximum and 
minimum wind velocity. Comparing the numerical methods, we see that numerical 
dispersion introduces oscillations behind the propagating contamination front into 
the MacCormack method solution. This complicates the determination of the pre- 
cise peak level of contamination. On the other hand, the numerical diffusion in the 


upwind scheme solution makes the localization of the front itself less precise than 
in the MacCormack method solution. 
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Nonlinear Convection 


Suppose the convection-diffusion equation that we need to solve is of the form 


dc Of (ce) Oc 


at dn Ox?! (7) 


that is, the convection effects are nonlinear. To implement upwind differencing for 
this version of the convection-diffusion equation, we note, as in Section 11.1, that 


Af(e) _ af de 
Or Oc Aa’ 
Thus, 6f/0c plays the role of the flow velocity. The analogues of equations (2) 
and (3) for equation (7) are then 


(n+1) _ (ny , At 
cp Sey + Re paar = Pa +h (ce = acl”) 4: a) 


when Of/0c > 0 and 


(nt) cn), ALT ay gin n f 
7 +5 [1 - Mi] +n (ed }, = 20") + of?) 


j = 


when Of/dc < 0. The finite difference equations for the MacCormack method 
applied to (7) are 


Predictor: G= a” + $4 i Ap + fe (ee oe aci") + of.) 


x a (n+1) 1 (0) A & * * * 
Corrector: Cj =3 {el +c - — (fF; —fj-ijtp (aay — 2¢; +¢4:)} 


Note that we only needed to replace each term of the form (uc)” from equa- 
tions (5) and (6) by ae In each of these finite difference equations (for both the 
upwind scheme and the MacCormack method), we have used fo as a shorthand 


for f (ee. 


Application Problem 2: Stuck Behind a Slow-Moving Truck 


In the Chapter 11 Overview (see page 885), we considered the problem of traffic 
flow along a highway and developed the model 


Ot br Ox? 


for the evolution of the vehicle density profile, p{x,t). Here, f(p) = v(p)p, u(p) is 
the density-dependent traffic velocity and D is a constant. In the previous section, 
we worked with the simplified advection equation model corresponding to D = 0; 
we will now treat the full convection-diffusion model. , 
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As in Section 11.2, we take 


40, “2<-l 
p(z,0) =< 100+ 60cos(rz), -l<a<l 
40, g>l1 


as an approximation to the density profile surrounding a slow-moving truck that is 
located at z = 0 and that exits the highway at t = 0. For simplicity, we assume 
this initial condition extends out to infinity in both directions. Further, we will 
continue to use 


f(2) = pV¥max fp-o7-£ -03( f i 


Pmax Pmax 


where Vinax = 50 mph is the speed limit on the road and pmax = 400 cars/mile. 
Finally, let D = 0.114 miles? /hour. 

With Az = 0.01 miles, the maximum allowable time step for the upwind 
scheme is 


V, 7 a ae 50 0.228 \77 1 
Atup < ( = + ——_ Pe (pen ocean gece oy Foe 
Re ( Az F) (ite i iia) 7280 


For convenience, Aty, = 1/7300 hours is used. Atmac = At/Vinax = 2 X 10-4 
hours is selected for the MacCormack method. Figure 11.13 displays the computed 
density profiles 0.6 minutes, 1.8 minutes, 3.0 minutes, and 4.2 minutes after the 
truck has exited the highway. 

There are three observations that can be made from this figure. First, there 
was not enough diffusion present to prevent the formation of a shock. Second, com- 
paring the bottom graph in Figure 11.13 with Figure 11.8, we see that the presence 
of diffusion has reduced the effects of dispersion in the MacCormack method so- 
lutions. Third, the MacCormack method maintains a steeper profile on the jump 
discontinuity and causes less rounding of the peak in the profile than does the 
upwind scheme. 
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EXERCISES 
1. Show that the modified equation associated with the finite difference equation 


een = of) + 6a _ oy”) +p ea = ack) + a) 


ac 


dc , 8 He 
= Ox? 


at Oe 


[D+ suds(1— )] 
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Figure 11.13 Vehicle density profiles after a slow moving truck exits 
the highway. 


2. Derive the modified equation for the MacCormack method. 


3. Use both the upwind scheme and the MacCormack method to approximate the 
solution of the partial differential equation 


dc, A(c?/2) Ae 
Be pe eee 


subject to the initial condition 


_ | l-cos(2rz), O<S2<1 
ela, 0) = { 0, elsewhere 


at t= 0.1,¢ = 0.2, = 0.3, and £=0.4. Remember to truncate the domain and 
introduce appropriate numerical boundary conditions. 

4. Use both the upwind scheme and the MacCormack method to approximate the 
solution of the initial boundary value problem 


2 2 

i + aed = 01s, e(z,0) =1+cos(rz), c(—1,t) =c(1,t) =0 
att=1,t=2,t=3,andt=4, 

5. Use both the upwind scheme and the MacCormack method to approximate the 

solution of the initial boundary value problem 


dc  A(ac) 16° 


a ceo peep ,t>0, 
ar Ox 2 Ox?’ Boe 


10. 


11. 
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- ay | L—cos(2xz), O<a<l ie 
o(z,0) = { 0, elsewhere, c(0,t) = 0 


at t = 0.5,t = 1, t = 1.5, and t = 2. Remember to truncate the domain and 
introduce appropriate numerical boundary conditions. 


» In the “Stuck Behind a Slow-Moving Truck” application problem, we found that 


a diffusion coefficient of D = 0.114 miles?/hour was not sufficient to prevent the 
formation of a shock. Estimate how large the diffusion coefficient must be to 
prevent the formation of a shock, at least up to t = 4.2 minutes. 


. Rework the “Stuck Behind a Slow-Moving Truck” application problem for each 


of the following initial vehicle density profiles. 


160, a <0 

(a) p(z,0) =< 100+ 60cos(rz), O<2r<l1 
40, z>1 
70, 2<o 

(b) (2,0) = 4 100-7 cos(rz), O<a<2 
40, 2 
160, “<0 

(c) p(x,0) =< 20~20cos(rz/2), 0O<2r< 2 
40, z>2 


. Rework the “Passive Contaminant Transport” application problem replacing the 


boundary condition c(0,t) = 1 with ¢c(0,t) =1—- 22 Does the MacCormack 
method still produce oscillations trailing behind the moving front? 


. Reconsider the “Passive Contaminant Transport” application problem. Estimate 


the peak contamination level and the location of the contamination front after 30 
minutes, 1 hour and 2 hours for the following combinations of diffusion coefficient 
and wind velocity. 


(a) D=1.3x 107% miles”/hour, u = 2.5[1 + e ?/° cos(1.7m2/5)] miles/hour 
(b) D=1.3 x 1074 miles”/hour, wu = 2[1 + 0.8e72/8 cos(1.77x/5)| miles/hour 
(c) D=1.3 x 1073 miles?/hour, u = 21 + 0.8e7 2/5 cos(1.77z/5)| miles/hour 
Repeat, Exercise 9 replacing the boundary condition c(0,t) = 1 with ¢c(0,t) = 
1-—e 2. 
Consider the convection-diffusion equation 

Oc | O(uc) ae 


an os > oe 


(a) Develop the forward in time, central in space (FTCS) method for this prob- 
lem. 


(b) Assuming that u is constant, determine the stability condition associated 
with the FT'CS method. 

(c) Determine the modified equation associated with the F'TCS method. 

(d) Approximate the solution of the convection-diffusion equation 


2 
Cena ee ape 


RTOs =D5a (t)=1 (2,t)=0, efz,0) =0 
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at t= 4 for D = 1/40 and D = 1/400. Compare the solution with that of 
the upwind scheme and the MacCormack method. 


12. Consider the convection-diffusion equation 


ac re Of (c) pave 


at Ax Ox? 
(a) Develop the FTCS method for this problem. 


(b) Use the results of part (a) to rework Exercise 3. How does the performance 
of the FTCS method compare with that of the upwind scheme and the 
MacCormack method? 


11.4 THE WAVE EQUATION 


To finish this chapter, we will develop a finite difference method for the second-order 
hyperbolic partial differential equation 


This is known as the wave equation and models such diverse phenomena as the 
vibration of a piano or guitar string, the oscillation of pressure within the pipe of an 
organ or within a blood vessel and the propagation of a signal along a transmission 
line. Being second order in both the time and the space variable, the wave equation 
requires two initial conditions and two boundary conditions. The initial conditions 
take the form 


7 Ou, 
u(z,to) = f(z) end a(x, to) = 9(2). 
In general, the wave equation can be defined over the entire real line (an infinite 
interval), a semi-infinite interval or a finite interval, and a variety of diferent bound- 


ary conditions can be specified. Here, we will restrict attention to the finite interval 
A<z«< B with Dirichlet boundary conditions 


u(A,t) = a(t) and u(B,£) = G(t). 


Generalizations to the differential equation and to the boundary conditions will be 
considered in the exercises. 


The Method 


Let 2; = A+ jAz for j = 0, 1, 2, ..., N for some positive integer N, where 
As = (B—A)/N. Further, let ta = to + nAt for some uniform time step At. 
Evaluate the wave equation at the arbitrary grid point (2;,t,). Next, replace 
each derivative by its second-order central difference approximation and drop the 
truncation error terms. This yields 


ae) = aw”) 4 wr? wh”, = aw” + wy 


(Ae (ar 
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where wi) 


equation 


(n+1) 


= u(z;,tn). Solving for w; , we arrive at the explicit finite difference 


geen = dwi) + + 2(1 — A)w; vis ws = yr), (1) 


where \ = (cAt/Az)?. Equation (1) is a three time-level finite difference equation: 
values of the approximate solution at t = tp4) are determined by the values at both 
t=t, andt=t,_;. Hence, this equation only applies for n > 1. 

So how do we compute the approximate solution at t = t,? We can handle this 
situation in precisely the same manner with which we have treated non-Dirichlet 
boundary conditions in the previous three chapters: use fictitious nodes. Evaluating 
equation (1) at 7 = 0 gives 


1 0 (0 0 -1 
wh ) = dui, + 2(1 — Aju! 4 rw{t), - ws ) (2) 
where w‘—) is associated with a fictitious node. To eliminate this term, evaluate 
the derivative initial condition at (z;,to) and replace Qu/dt by its second-order 
central difference approximation: 


re eee as 1 5 
+t =9 = uw ea — 2hig;. (3) 


Q) 


Substituting equation (3) into equation (2) and solving for w;’ yields 


w) = sx (0) t= aly; oe =, 4 


j g+1 + At gy. 


Finally, taking into account the initial condition on u, we have 


1 1 
wl) = Sfp + (1—AVSy + 5Afia + At gy. (4) 


To summarize, the initial condition u(x, t9) = f(z) provides the values of the 
approximate solution to the wave equation at t = ty. Next, the values at t = t) are 
computed from equation (4). Equation (1) is then used to advance the approximate 
solution to t = te, t = t3, t = ty and so on. The resulting approximation is second- 
order accurate in both At and Az. 


Stability Analysis 


When we use the numerical method that has just been developed, and select Az 
to resolve the anticipated spatial variations in the solution, what restriction, if 
any, is placed on At? Let’s answer this question by performing a von Neumann 
stability analysis. Substitution of the discrete Fourier mode, ae = rei) | into 
equation (1) leads to 


prth (74) = pe gii9) [Ae + 2(1 = d) + er pra e658), (5) 
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Dividing equation (5) by r™~1e#9, using the identity e + e-# = 2cosé@, and 
transposing all terms to the left-hand side of the equation, we find that the ampli- 
fication factor satisfies the quadratic equation 


r? — 2[(1 — A) + Acos@] +1 = 0. (6) 


For the numerical method to be stable, both roots of this quadratic must be less 
than or equal to one in magnitude. 

The first thing to note about equation (6) is that the leading coefficient and 
the constant term are both 1. This implies that the product of the roots is 1. If 
these roots are real and distinct, then one of them must be strictly larger than one 
in magnitude, and the method will be unstable. Thus, for the method to be stable, 
(6) must have either a double real root or complex conjugate roots. In other words, 
the discriminant, 

4{(1 — d) + Acos 6]° — 4, 


must be Jess than or equal to zero. This will happen when 
[(. — A) + Acosé]* < 1, 


or 
-1<(1—A)+Acos@ <1. 


Tne inequality on the right holds for all \, but the inequality on the left requires 
\ <1. Hence, our numerical method for the wave equation is conditionally stable 
anid requires At < Az/e. 


Worked Examples 


EXAMPLE 11.7 Our Numerical Method in Action 
Consider the wave equation 


Ou _ gal Ou 


HE opp OXF <hE> 0 


with boundary and initial conditions given by 
u(0,t) = —sin(t/5), u(1,t) = sin(1 — ¢/5) 


Ou. 
Ot 


Comparing the current problem with the model wave equation, we find that 2 = 
1/25, or ¢ = 1/5. For all of our calculations, we will choose \ = 1; that is, we will 
use the maximum time step allowed by stability. Given that c = 1/5, this means 
that At and Az must satisfy the relation At = 5 Az. 


1 
u(x,0) = sina, (z,0) = — 5 COS. 
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Let’s start with Ag = 1/5 and At = 1 and advance the solution from ¢ = 0 
tot = 2. Using the value of the initial condition u(x, 0) = sin at the interior grid 
points and the boundary conditions at the boundary grid points, it follows that 


w=[0 sin0.2 sin0.4 sin0.6 sin0.8 sinl 
=| 0 0.198669 0.389418 0.564642 0.717356 0.841471 ict 


ig 
Values have been displayed to six significant figures for convenience. With \ = 1 
and At = 1, equation (4) becomes 


rT 1 
wh = 5 (fi + fpnt) + 93. 


Thus, for 7 =1, 2, 3, and 4, 


) 


The boundary conditions supply wl = —sin0.2 and wo) = sin0.8. Therefore, 


after the first time step 


w(t) = [ 0.198669 -0.00130414 0.197444 0.388320 0.563715 0.717356 ]’. 
For the second time step, note that with 4 = 1, equation (1) becomes 


mel) _ a6) 


(n) (n-1) 
pe — Ww; : 


ws pt Wea 
Applying this equation for j =1, 2, 3, and 4 and obtaining w = —sin0.4 and 
wl?) = sin 0.6 from the boundary conditions yields 


wi) — [ —0.389418 —-0.199895 --0.00240239 0.196516 0.388320 0.564642 |”. 


The following table compares the values of the approximate solution with the values 
of the exact solution, u(z,¢) = sin(x —¢/5), at ¢ = 2. 


£5 w?) u(z;,2) Absolute Error 


0 -0.389418 — -0.389418 0.000000 
0.2 -0.199895  -0.198669 0.001226 
0.4 -0.00240239 0.000000 0.002402 
0.6 0.196516 0.198669 0.002153 
0.8 0.388320 0.389418 0.001098 
1.0 0.564642 0.564642 0.000000 


A numerical verification of the second-order accuracy of the scheme is pre- 
sented in the next table. Both the maximum absolute error and the root mean- 
square (rms) error in the approximate solution at t = 1, as a function of the number 
of subintervals along the z-axis, NX, and the number of time steps taken to reach 
t= 1, NT, are displayed. Clearly, each time At and Az are cut in half, the corre- 
sponding error drops by a factor of 4. 
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NT NX Maximum Error Error Ratio rms Error Error Ratio 
B) 0.0013041444 0.0010265081 
10 0.0003248943 4.014058 0.0002482593 4.134823 
20 0.0000811525 4.003504 0.0000615365 4.034345 
40 0.0000202837 4.000875 0.0000153510 4.008628 
6 680 0.0000050706 4.000219 0.0000038357 4.002160 


rFPorwnr 


EXAMPLE 11.8 Better than Expected Performance 


As a second example, consider the wave equation 


Ca ee 0<a2<1,t>0 
at? Ax?’ = , 


subject to the boundary conditions u(0,t) = u(1,t) = 0 and the initial conditions 


Bu 


Di (z,0) =0. 


u{a,0) = sinrs, 
The two tables given below compare the approximate solution, computed with 
Ar = At = 1/10, to the exact solution, u(x,t) = sintacos nt, for two different 
values of t. In each case, the absolute error in the approximate solution is much 
smaller than one would expect from a second-order method. In fact, it appears as 
if we have obtained the exact sohition, to within roundoff errors. That we have, 
indeed, obtained the exact solution is established following the end of this example. 


t = 2 (20 time steps) t = 3.2 (32 time steps) 
25 wr? u(z;,2) Absolute Error wo u(zj,3.2) Absolute Error 
0.0 0.000000 0.906000 0.000000 0.000000 —0.000000 9.000000 
0.1 0.309017 0.309017 5.551115 x 107*” || -0.250000 —0.250000 2.775558 x 1072" 
0.2 0.587785 0.587785 0.000000 0.475528 ~0.475528 7.771561 x 107*° 
0.3 0.809017 0.809017 0.000000 —0.654508 —0.654508 0.000000 
0.4 0.951057 0.951057 3.330669 x 1071® || -0.769421 --0.769421 7.771561 x 107° 
0.5 1.000000 1.900000 0.000000 0.809017 —0.809017 0.000000 
06 0.951057 0.951057 3.330669 x 10776 |] -0.769421 —0.769421 7.771561 x 107° 
0.7 0.809017 0.809017 0.000000 0.654508 —0.654508 0.000000 
0.8 0.587785 0.587785 0.000000 0.475528 —0.475528 7.771561 x 107*° 
0.9 0.309017 0.309017 5.881115 x 107!" || 0.250000 —0,250000 2.775558 x 1071” 
1.0 0.000000 —0.000000 0.000000 0.0000Cd 0.000000 0.060000 


a EE 


How can we explain the performance in this second example? First, consider 
equation (1). If the exact solution of the wave equation is infinitely differentiable, 
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then the truncation error associated with equation (1) is given by 


(on Ot yl ae o) Se du. (Ax)* _) 


@ ae ° 6 O28 
ae au —-y (Az)® _ a 


4. Ot ° (OSC 


OS BIC 
Using the wave equation, we can establish that 
Oe Or fn a? f 30° pO? LOU 404u 
att AP (5a) ~ Oe (< at) ~ © Be2 (Se) Bea 


In a similar manner, it can be shown that 


E2ry 7 om Oru 


Byam an2™" 
Converting all of the t-derivatives in equation (7) to x-derivatives yields 
27 Ap)? 4 2 4 6 2 Ap)6 8 
e*(Az) Ose e*(Az) pee e'(Az) (Sateen 
4\ dx4 6! dx 8! 3x8 


Thus, with A = 1, the truncation error associated with the second, third, fourth, 
and so on time steps is identically zero. 

What about the error associated with the first time step; that is, with equa- 
tion (4)? Recall that when \ = 1, equation (4) reduces to 


1 

wi = = (fyaa + fas) + Atgy. (8) 

This is to be compared with the d’Alembert solution to the wave equation 

1 1 xtet 
ulest) =F [fe a) + seer enl+ sf ol6)as (9) 
x—ct 
Evaluating (9) at ¢ = 2; andi = At yields 
1 At Ti41 

u(z;, At) = 2 (hae gaa DAe Jy, 9(§) a, (10) 


where we have made repeated use of the relation cAt = Ar (which follows from 
A = 1). Note that equation (8) can be viewed as an approximation to (10) obtained 
by estimating the value of the integral in (10) with the midpoint rule. The first 
time step will therefore introduce some error unless the midpoint rule integrates g 
exactly, which will happen when g is constant or linear. 

Putting all of this together, we can expect our numerical method to produce 
the exact solution when 


(1) the exact solution is infinitely differentiable; 
(2) calculations are made with 4 = 1; and 


(3) g is constant or Jinear. 


It is easy to verify that the problem in the second example satisfies all three condi- 
tions. Thus the only errors in the second example are roundoff errors. 
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An Application Problem: Transmission Line 


Consider a 100-meter-long transmission line with inductance per unit length L = 1.2 
henries/meter and capacitance per unit length C = 0.3 farads/meter. Let V(z, t) 
denote the voltage along the line. If the resistance of the cable and current leakage 
are negligible, then, from the Chapter 11 Overview (see page 885), we know that 
V (az, 6) satisfies the wave equation 


av av 
Ot? ~ LC ax?’ 
Suppose that the line is originally dead; that is, 
ov 
Vv parca = 
(2,0) = $-(2,0) =0, 


and the right end of the line is grounded, V(100,t) = 0. At the left end of the line, 
the following voltage is impressed: 


_ J 110sin(nt/50), O<t < 100 
OE { 0, t > 100 


Figure 11.14 displays the voltage along the transmission line at ¢ = 30, t = 60, 
t = 90, and t = 120 seconds. All calculations were performed with Az = 1 meter 
and At = 0.6 seconds. 


EXERCISES 


In Exercises 1-4, approximate the solution of the given wave equation. In each case, 
compare the approximate solution with the indicated exact solution and, where possible, 
numerically verify the second-order accuracy of the finite difference scheme. Explain 
any unusual findings. 


au Oru = = 
1. Bz ~ SE O0<2<1,t>0, u(0,t) = u(l,t) =90, 
u{z,0) =, 5 (70) =3nsinrz; exact solution: u(z,t) = sin tz sin 3at 
o, Fv _ oe" nene Lt>0, u(0,t) =cosh2t, u(1,t) = ecosh2t+t, 
at? Ox?’ 
ule, 0) =e", a 0) =; exact solution: u(x,t) = e” cosh 2t + xt 
é 
Bu 1 au : 
=e =a, 0 <b), 0, t) = 2sinh(t/2), 
3a Gat Oe >0, ul a (t/2) 


u(1,t) = 2esinh(t/2) +1, u(a,0) =<, Hy OO) =e 
exact solution: u(x,t) = 2e* sinh(t/2) + x 

* ot2 ~ Ax?’ 
u(z,0) = 0, 


O<ar<lt>0, u0,t=0/3, ull) =t4 0/3, 
Ou 


ae (x,0) = 2°; exact solution: u(z, t) = at +t°/3 
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f = 30 seconds t= 60 seconds 
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Figure 11.14 Voltage along 100 meter transmission line after 30, 60, 
90, and 120 seconds. 


5. Let y(z,t) denote the lateral displacement of a vibrating string. If T is the 
tension in the string, w is the weight per unit length, and g is acceleration due 
to gravity, then y satisfies the equation 


dy _ To dy 
bt? ~  w Ax? 


Suppose a particular string is 6 feet long and is fixed at both ends. Taking 
T = 32 pounds, w = 0.01 pounds/foot, and g = 32 feet/second?, use the finite 
difference method to estimate the period of oscillation of the string. The initial 
conditions are 


«6, O<2<3 
gc - and 


He (6-2)/6, 3<2< 


6. Consider the wave equation with damping term: 


74 (2,0) = 2(z — 6). 


a7u ou ou 

praia —=c 25, O<r< Lt, 

bt? +H © Ort? a 

where js and ¢ are constant. 

{a) Develop a second-order finite difference method for this equation. Assume 
that Dirichlet boundary conditions are specified at both ends of the domain. 
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(b) 


Use the method developed in part (a) to solve the problem: 

Consider a 100-meter-long transmission line with inductance per unit length 
£ = 1.2 henries/meter and capacitance per unit length C = 0.3 farads per 
meter. Let V(z,t) denote the voltage along the line. If the resistance of the 
cable is negligible but current leakage is assumed proportional to voltage, 
then V(x, t) satisfies 


eV p8V _ 1 av 
dt? at LC da?” 
Take G = 0.0038 (s)~* and suppose that the line is originally dead; that is, 


The right end of the line is grounded, V(100,t) = 0. At the left end of the 
line, the following voltage is impressed: 


_ f 11Osin(nt/50), 0<t < 100 
Oa = { 0, t > 100 


Determine the voltage after 30 seconds, 60 seconds, 90 seconds, and 120 
seconds. 


7. Consider the wave equation with source term: 


0? 2 
So = OSS + 3la, 0), O<a<L,t>0, 


where c is constant. 


(a) 
(b) 


Develop a second-order finite difference method for this equation. Assume 
that Dirichlet boundary conditions are specified at both ends of the domain. 
Use the method developed in part (a) to solve the problem: 
Let y(x,t) denote the lateral displacement of a vibrating string. If T is the 
tension in the string, w is the weight per unit length, g is acceleration due 
to gravity, and F is an applied force per unit length, then y satisfies the 
equation 

ay To Sy | gF(a,t) 

at? ww Ox? w 
Suppose a particular string is 6 feet long and is fixed at both ends. Taking 
T = 32 pounds, w = 0.01 pounds/foot, g = 32 feet/second”, and F(z,t) = 
5sin(127¢) pounds/foot, determine the profile of the string after 10 seconds, 
20 seconds, and 30 seconds, assuming that the string starts from rest. 


8. Consider the wave equation: 


v) 2 
a ee a O<a<Lt>0, 


Oe ° Ox?’ 
where ¢ is constant, subject to the boundary conditions 
Ou 
w(0,t) = alt), 5e(L,4) = Bld). 


(a) 


Develop a second-order finite difference method for this equation. 
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(b) Use the method developed in part (a) to solve the problem: 


(c) 


Consider a 100-meter-long transmission line with inductance per unit length 
L = 1.2 henries/meter and capacitance per unit length C = 0.3 farads per 
meter. Let V(x, ¢) denote the voltage along the line. If the resistance of the 
cable and current leakage are negligible, then V(z, t) satisfies 


av 1 ev 
at? ~~ LC Oz?" 
Suppose that the line is originally dead; that is, 
av 


V(z,0) = ap 0) =0. 


The right end of the line is open, so that OV/Oz = 0. At the left end of the 
line, the following voltage is impressed: 


_ f 110sin(nt/50), 0<t < 100 
VE { 0, t > 100. 


Determine the voltage after 30 seconds, 60 seconds, 90 seconds, and 120 
seconds. 
Use the method developed in part (a) to solve the problem: 
A. 1-meter-long organ pipe is closed at the top end. The pressure, p(z, ¢), in 
the pipe satisfies 

a*p 7 8 


at? Ox?’ 
subject to the initial conditions 


pla, 0) = patm cos(32), oe (x,0) =0 


and the boundary conditions 


Op 
p(0,t) = Patm, or (1,t) = 0. 


Take patm = 1.05, and determine the pressure at t = 5, = 10, andt= 15. 


APPENDIX A 


Important Theorems from 
Calculus 


The following theorems play an important role in the development and the analysis 
of rnany of the numerical methods presented in this text. Proofs for these theo- 
rems cao be found in most elementary calculus or real analysis/advanced calculus 
textbooks. 


Rolle’s Theorem 
If f is continuous on [a,b] and differentiable on (a,6) with f(a} = f(b) = 0, then 
there exists a number c € (a,6) with f’{c) = 0. 


Generalized Rolle’s Theorem 

If f is continuous ou [2,], has n continuous derivatives on (a,b) and is equal to 
zero at n+ 1 distinct points in [a,b], then there exists a number c € (a,b) with 
fc) =0. 


Mean Value Theorem 
If f is continuous on [a,é| and differentiable on f@,8), then there exists a number 
é € (@,6) such that 

f(b) — f(a) 


u _ 
f (c} — b —a . 
Weighted Mean Value Theorem for Integrals 
If f is continuous on [a, 0], g is integrable on [a, 6] and g{z) does not change sign 
on fa, b], then there exists a number c € (a,b) such that 


6 b 
[ teateyer= ste) [ otadee 


Extreme Value Theorem 
If f is continuous on [a,b], then there exist numbers ci,¢2 € |a,6] with flei) < 
F(x) < flez) for all a € fa, 4). 

NOTE: lf f is differentiable on (@,0), then ¢; and cz occur either at the 
endpoints of [a,b] or where f’(c) = 0. 


Intermediate Value Theorem 


If f is continuous on [a,b] and & is ANY number between f(a) and f(d), then there 
exists a number ¢ € (a,6) with f(c) = &. 
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Taylor’s Theorem 

Suppose f is continuous on [a,b], has n continuous derivatives on (a,b) and firtD 
exists on [a,b]. Let x9 € [a,b]. For every x € [a, 0] there exists (2) between x and 
Zc such that 


F(z) = Pp(z) + Pals), 
where 


Re lk) 
Pa (at) = $7 {0) fg — a) 


{ 
na kl 


and ahs 
PB 
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Algorithm for Solving a 
Tridiagonal System of Linear 
Equations 


Many of the techniques developed in this text require the solution of a linear system 
of equations with a tridiagonal coefficient matrix. These include the computation 
of a cubic spline interpolant, the solution of a one-dimensional two-point boundary 
value problem using the finite difference method and the solution of parabolic partial 
differential equations using the finite difference method. 

Suppose the system that must be solved can be written in the form 


ay by Wy [ fi 
Cz ag by we fa 
cz 03 bg wy fs 

= 3 
Cn-1 On—1 bn~1 Wn~l fr-1 
Cr ay L Wa L fn 


or Aw = f. The solution algorithm consists of three distinct parts. First, an LU 
decomposition is performed on the matrix A. This process factors the coefficient 
matrix into a lower triangular matrix, L, and an upper triangular matrix, U, such 
LU = A. The LU decomposition transforms the original problem into the form 
LUw =f. Let z denote the vector Uw. The solution to the original system can 
now be obtained by applying forward substitution to Lz = f, followed by backward 
substitution applied to Uw = z. 


ALGORITHM 


*** fret, the LU decomposition *** 


Ly =a, 

en = b, far 

for i = 2, 3, 4, ..,2-1 
Ly =a: — GU 
U; = bf; 

La = On — CU nai 
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*** now, the forward substitution *** 


a= fifla 
for i = 2, 3, 4, ..., 7 


2 = (fi -— cizj-1)/ Li 


*** finally, the backward substitution *** 


Wn = Zn 
fori =n—-1,n-—2,n—3,..,1 
wi = % — Uwi4s 


NOTE: It is possible to code this algorithm with no additional storage by storing 
each value of L in place of a, each value of U in place of b, each value of z in place 
of f, and finally each value of w in place of z. 
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O(-}, see Big-O notation, 20 
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A-conjugate vectors, 240 
A-stability, 644 
Absolute stability, 639 
Adams fourth-order 
predictor-corrector, 591 
in variable step size algorithm, 
617 
Adams-Bashforth methods, 583 
four-step method, 585 
regions of absolute stability, 643 
three-step method, 585 
two-step method, 585 
system of equations, 625 
Adams-Moulton methods, 587 
handling implicit equation, 590 
regions of absolute stability, 643 
three-step method, 590 
trapezoidal method, 591 
two-step method, 590 
Adaptive quadrature 
generic error estimate, strategy 
1, 507 
generic error estimate, strategy 
2, 512 
Newton-Cotes vs. Gaussian, 514 
strategy 1, 506 
strategy 2, 511 
ADI scheme 
basic formulation, 868 
unconditional stability, 870 
with source and decay terms, 
874 
Aitken’s A?-method, 114 
Algorithm, 10 
iterative, 14 
stopping condition, 15 
Approximating eigenvalues 
Hotelling deflation, 301 


inverse power method, 281 
power method, 265 
QR algorithm, 321 
reduction to tridiagonal form, 
311 
Wielandt deflation, 299 
Artificial singularity, 660 
handling in finite difference 
method, 678 
handling in shooting method, 
706 
Asymptotic error constant, 23 
Augmented matrix, 150 


Backward differentiation formulas 
absolute stability, 648 
backward Euler method, 648 
derivation, 647 
second order, 648 
third order, 648 

Backward error analysis, 185 

Backward Euler method, 648 
A-stability, 648 

Bairstow’s method, 134 

BDFs, see Backward differentiation 

formulas, 647 

Biconjugate gradient (BiCG) 

method, 246 

Big-O notation 
functions, 22, 448 
sequences, 20 

Bisection method, 59 
convergence, 61 

Boole’s rule, 458 

Boundary value problem (BVP), 

533, 656 
approximate solution, see Finite 
difference method, 660, 
see Shooting method, 660 
with artificial singularity, 660 
Brent’s method, 253 
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Brown’s method, 253 
Broyden’s method, 253 
BTCS method 
formulation 
basic heat equation, 810 
in two dimensions, 867 
polar coordinates, annulus, 
858 
polar coordinates, disk, 854 
with decay term, 832 
with non-Dirichlet boundary 
conditions, 842 
with source term, 830 
unconditional stability, 822, 830, 
833, 844, 855, 858 


Cauchy-Buniakowski-Schwarz 
inequality, 172 
CFL condition, 890 
Characteristics, 887 
Chebyshev polynomials, 374 
extrema, 376 
minimax property, 378 
monic, 378 
orthogonality, 385 
recurrence relation, 375 
roots, 376 
Cholesky decomposition, 216 
Composite midpoint rule, 479 
Composite Simpson’s rule, 470 
in adpative quadrature, 506 
Composite trapezoidal rule, 468 
in Romberg integration, 497 
with periodic integrands, 475 
Computational template, 662, 674, 
688, 732, 792 
Condition number, 183 
Conditioning, 37-38 
Conjugate gradient method 
finite termination in exact 
arithmetic, 245 
preconditioning, 246 
pseudocode, 243 
search direction, 240 
step size, 239 


Crank-Nicolson scheme 
formulation 
basic heat equation, 812 
in two dimensions, 867 
polar coordinates, annulus, 
858 
polar coordinates, disk, 855 
with decay term, 832 
with non-Dirichlet boundary 
conditions, 843 
with source term, 830 
unconditional stability, 824, 830, 
833, 844, 855, 858 
Crout decomposition, 205 
Cubic spline interpolant, 393 
clamped boundary conditions, 
398 
minimum curvature property, 
400 
error, 401 
natural boundary conditions, 
404 
minimum curvature property, 
405 
not-a-knot boundary conditions, 
395 


Dahlquist barrier, 648 

Deflation, 127, 296 

Degree of precision, 460 

Direct factorization, 194, 205 
Cholesky decomposition, 216 
Crout decomposition, 205 
Doolittle decomposition, 205 

Dirichlet boundary condition, 656, 

726 

Divided differences, 364 
relation to derivatives, 370, 407 

Doolittle decomposition, 205 


Eigenvalue, 177, 261 
dominant, 265 
localizing, 262 
of AT, 297 
of inverse matrix, 281 
of polynomial of matrix, 281 


Eigenvector, 177, 261 
of inverse matrix, 281 
of polynomial of matrix, 281 
of symmetric matrix, 269 
relation to eigenvector of A’, 
297 
Elementary row operations, 150 
Error 
absolute, 34 
cancellation, 44 
introduced, 43 
propagated, 43 
relative, 34 
Euler’s method, 547 
effect of roundoff error, 554 
global] discretization error, 551 
local truncation error, 551 
system of equations, 625 
Euler-Maclaurin sum formula, 476 
Extrapolation, 447 
in adaptive quadrature, 506 
in Richardson extrapolation, 449 
in Romberg integration, 497 
Extreme Value Theorem, A-1 


False diffusion, 892 
Fictitious node, 674, 743 
Finite difference approximations 
effect of roundoff error, 443 
error terms, 440 
first derivative 
first-order backward, 439 
first-order forward, 439 
second-order backward, 442 
second-order central, 442 
second-order forward, 441 
second derivative 
second-order central, 443 
Finite difference method for linear 
BVPs 
basic objective and process, 661 
computational grid, 661, 674 
computational template, 662, 
674 
Dirichlet boundary conditions 
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matrix formulation, 663, 664 
sufficient condition for unique 
solution, 663 
non-Dirichlet boundary 
conditions 
fictitious node, 674 
generic matrix formulation, 
677 
handling an artificial 
singularity, 678 
Finite difference method for 
nonlinear BVPs 
computational grid, 687 
computational template, 688 
initial vector for Newton’s 
method, 690, 693 
non-Dirichlet boundary 
conditions, 692-694 
solution of algebraic equations, 
689 
structure of Jacobian, 690 
Finite difference method for 
parabolic PDEs 
o-general scheme, 827 
ADI scheme, 867 
BTCS method, 810 
Crank-Nicolson scheme, 812 
FTCS method, 807 
matrix stability analysis, 
818-819 
semidiscrete approximation, 806 
stability, 818 
von Neumann stability analysis, 
824 
Finite difference method for Poisson 
problem 
computational grid, 731 
computational template, 732 
nonuniform grid, 792 
polar coordinates, 795 
convergence analysis of iterative 
methods, see Local mode 
analysis, 763 
asymptotic convergence rate, 
763 


I-4 Index 


Dirichlet boundary conditions, 
731~736 
five-point star, 733 
nonuniform grid, 792 
polar coordinates, 795 
irregular domains 
curved boundaries, 791 
sloped boundaries, 787 
iterative solution of discrete 
equations 
Gauss-Seidel method, 758 
grid function, 757 
Jacobi method, 758 
optimal relaxation parameter, 
758 
programming hints, 765 
rationale, 757 
SOR method, 758 
lexicographic ordering, 733 
matrix structure, 736 
non-Dirichlet boundary 
conditions, 743-745, 
747-750 
fictitious node, 743, 748 
polar coordinates, 795 
red-black ordering, 742, 786 
Finite difference method for wave 
equation, 924 
produces exact solution, 
928-929 
stability, 925 
Fixed point, 83 
existence and uniqueness, 84 
Fixed point iteration, 85 
accelerating convergence, 117 
convergence, 87 
Newton's method, 95 
order of convergence, 90 
Floating point arithmetic, 42 
Floating point: equivalent, 34 
chopping, 34 
rounding, 34 
Floating point number system, 31 
Fourth-order Runge Kutta method, 
classical, 574 


FTCS method 
conditional stability 
basic formulation, 820, 826 
in two dimensions, 866 
polar coordinates, annulus, 
858 
polar coordinates, disk, 854 
with decay term, 833 
with non-Dirichlet boundary 
conditions, 844 
with source term, 830 
evolution matrix, 808 
formulation 
basic heat equation, 807-808 
in two dimensions, 866 
polar coordinates, annulus, 
858 
polar coordinates, disk, 853 
with decay term, 832 
with non-Dirichlet boundary 
conditions, 842 
with source term, 829 
Function norm 
ly, 374 
loo, 374 
Functional iteration, see Fixed point 
iteration, 85 


Gauss-Jordan elimination, 156 
operation counts, 156 
Gauss-Seidel method 
component form, 228 
convergence results, 232 
in finite difference method for 
Poisson problem, 758 
iteration matrix, 229 
splitting, 229 
Gaussian elimination, 150 
operation counts, 155 
Gaussian quadrature, 482 
composite two-point formula, 
486 
conversion from standardized 
interval, 485 
standardized interval, 483 


theoretical development, 
488~492 
three-point formula, 493 
two-point formula, 485 
undetermined coefficients, 483 
Generalized minimal residual 
(GMRES) method, 246 
Generalized Rolle’s Theorem, A-1 
Gerschgorin Circle Theorem, 263 


Hermite cubic interpolant, 410 
construction, 411 
error, 414 

Hermite interpolant, 405 
error, 410 
Lagrange form, 407 
Newton form, 407 
uniqueness, 406 

Heun method, 571 
as predictor-corrector scheme, 

591 
Hotelling deflation, 301 
Householder matrix, 311 


Improper integrals, 520 
discontinuous derivatives, 521 
infinite limits of integration, 527 
logarithmic discontinuities, 524 
nonremovable algebraic 

discontinuities, 524 
removable discontinuities, 522 

Tnitial boundary value problem, 805 

Initial value problem (IVP), 533 

Initial value problem solvers 
A-stable method, 644 
absolutely stable method, 639 
Adams fourth-order 

predictor-corrector, 591 
Adams-Bashforth methods, 583 
Adams-Moulton methods, 587 
backward differentiation 

formulas, 647 
backward Euler method, 648 
classical fourth-order 

Runge-Kutta method, 574 
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consistent, 540, 599 
convergent, 541, 599 
Euler’s method, 547 
global discretization error, 540 
Heun method, 571 
higher-order equations, 628 
local truncation error, 540 
modified Euler method, 571 
multistep method, 539, 583 
one-step method, 538 
optimal RK2 method, 572 
order, 540, 599 
predictor-corrector schemes, 591 
RKF45, 611 
RKV56, 614 
Runge-Kutta methods, 570 
second-order backward 
differentiation formula, 648 
stable, 541, 599 
system of equations, 623 
Taylor methods, 560 
trapezoidal method, 591, 645 
variable step size Adams 
fourth-order 
predictor-corrector, 617 
variable step size algorithms, 
608 
Intermediate Value Theorem, 58, A-1 
Inverse power method 
derivation, 282 
rate of convergence, 283 
Iterative solution of linear systems 
consistent methods, 225 
convergence, 224 
iteration matrix, 223 
rate of convergence, 225 
splitting methods, 225 
uniqueness of solution, 224 


Jacobi method 
component form, 227 
convergence results, 232 
in finite difference method for 
Poisson problem, 758 
iteration matrix, 227 
splitting, 226 
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Jacobian matrix, 251, 689 
Jenkins-Traub method, 134 


Lagrange form of interpolating 
polynomial, 343 
advantages, 349 
disadvantages, 349 
Hermite case, 407 
Lagrange polynomials, 343 
Hermite case, 406 
Laguerre’s method, 128 
Least squares regression, 
see Regression, 418 
Legendre polynomials, 381 
monic, 383 
orthogonality, 381, 382 
recurrence relation, 381 
Lehmer-Schur algorithm, 134 
Linear systems 
direct techniques 
Cholesky decomposition, 216 
Crout decomposition, 205 
Doolittle decomposition, 205 
Gauss-Jordan elimination, 156 
Gaussian elimination, 150 
LU decomposition, 197 
iterative techniques 
conjugate gradient method, 
237 
Gauss-Seidel method, 228 
Jacobi method, 226 
SOR method, 230 
Lipschitz condition, 542 
Local mode analysis 
aliasing, 773 
convergence factor, 775 
discrete Fourier basis function, 
773 
discrete Fourier mode, 773 
error evolution equation, 771 
smooth vs. nonsmooth 
components, 772 
Lower triangular matrix, 191 
LU decompasition, 192 
algorithm, 194-195 


in inverse power method, 283 
solution process, 197 


MacCormack method 
advection equation 
amplitude and phase errors, 
904 
formulation, 902-903 
modified equation, 904 
stability, 904 
systems, 906 
convection-diffusion equation 
formulation, 916, 920 
stability, 916 
Machine epsilon, 35 
Machine precision, 35 
Matrix, 142 
addition, 143 
characteristic polynomial, 177, 
261 
condition number, 183 
determinant, 145 
eigenvalue, 177 
eigenvector, 177 
elementary row operations, 150 
Householder, 311 
inverse, 145 
Jacobian, 251 
lower triangular, 191 
multiplication, 144 
orthogonal, 310 
permutation matrix, 197 
rotation, 322 
scalar multiplication, 143 
similar, 309 
similarity transformation, 309 
spectral radius, 178 
spectrum, 177, 261 
splitting, 225 
strictly diagonally dominant, 
211 
symmetric positive definite, 212 
transpose, 144 
tridiagonal, 217 
upper triangular, 150 


Matrix norm, 175 
1,, 181 
lo, 177 
log (maximum), 176 
Frobenius norm, 181 
natural norm, 175 
operator norm, 175 
Mean Value Theorem, 81, A-1 
Method of characteristics, 887 
Method of False Position, 71 
accelerating convergence, 116 
Method of steepest descent, 249 
Midpoint rule, 459 
basic error term, 463 
composite formula with error 
term, 479 
Milne device, 617 
Modified equation, 892 
Modified Euler method, 571 
Monic polynomials, 377 
Muliplicity, 56 
Multigrid method 
grid transfer 
bilinear interpolation, 777 
half injection, 786 
injection, 777 
two-grid method, 776-778 
V-cycle, 779 
work unit, 779 


Natural matrix norm, 175 
consistency property, 175 


Neumann boundary condition, 656, 


673, 726 
Neville’s algorithm, 356 


construction formula, 354-356 


Newton form of interpolating 
polynomial, 363 
advantages, 368 
coefficients, 364-365 
Hermite case, 407 
Newton’s method, 95 
convergence, 98, 101 
in finite difference method for 
nonlinear BVPs, 689 


Index I-77 


in trapezoidal method, 646 
order of convergence, 100 
restoring quadratic convergence, 
120 
system of equations, 250 
Newton-Cotes quadrature, 456 
Boole’s rule, 458 
closed vs. open, 456, 464 
composite, 467 
composite midpoint rule, 479 
composite Simpson’s rule, 469 
composite trapezoidal rule, 467 
general error theorem, 464 
midpoint rule, 459 
Simpson’s rule, 458 
trapezoidal rule, 457 
Nonlinear systems 
Broyden’s method, 253 
Newton’s method, 250 
Quasi-Newton methods, 252 
Norm 
matrix, see Matrix norm, 175 
vector, see Vector norm, 171 
Numerical diffusion, 892 
Numerical dispersion, 904 
Numerical integration, 455 
adaptive quadrature, 506 
degree of precision, 460 
effect of roundoff error, 478 
Gaussian quadrature, 456, 482 
composite two-point formula, 
486 
three-point formula, 493 
two-point formula, 485 
improper integrals, 520 
Newton-Cotes quadrature, 456 
Boole’s rule, 458 
composite midpoint rule, 479 
composite Simpson’s rule, 469 
composite trapezoidal rule, 
467 
midpoint rule, 459 
Simpson’s rule, 458 
trapezoidal rule, 457 
Romberg integration, 497 
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Numerical quadrature, 


see Numerical integration, 
455 


Operator norm, 175 

Optimal RK2 method, 572 
Order of convergence, 23 
Orthogonal functions, 381, 488 
Orthogonal matrix, 310 
Orthogonal set of vectors, 269 
Osculatory interpolation, 405 
Overflow, 33 


Parameter estimation, 
see Regression, 418 
Partial differential equation (PDE) 
boundary conditions, 726 
elliptic, 725 
Laplace equation, 725 
Poisson equation, 725 
ellipticity equation, 725 
hyperbolic, 730, 883 
advection equation, 883 
convection equation, 883 
convection-diffusion equation, 
883 
wave equation, 883 
initia] boundary value problem, 
805 
parabolic, 730, 803 
heat equation, 803 
Partial pivoting, 162 
Permutation matrix, 197 
Piecewise linear interpolant, 387 
formulas for coefficients, 388 
Piecewise polynomial interpolation 
piecewise cubic 
clamped cubic spline, 398 
cubic spline, 393 
Hermite cubic, 410 
natural cubic spline, 404 
not-a-knot cubic spline, 395 
piecewise linear, 387 
Polynomial interpolation 
divided differences, 364 


error, 348 
Hermite interpolation, 405 
Lagrange form, 343 
linear interpolation, 341 
Neville’s algorithm, 353 
Newton form, 363 
optimal points, le-norm, 383 
optimal points, lo9-norm, 379, 
380 
osculatory interpolation, 405 
uniqueness, 347 
Power method 
basic assumptions, 265 
general matrix 
derivation, 265 
rate of convergence, 266 
stopping condition, 267 
symmetric matrix 
derivation, 269 
rate of convergence, 270 
stopping criterion, 271 
Predictor-Corrector schemes, 591 
Adams fourth-order, 591 
in variable step size algorithms, 
616 
modified Euler method, 591 


QR algorithm, 321 
obtaining eigenvectors, 330 
pseudocode, 329 
Wilkinson shift, 328 
Quadrature, see Numerical 
integration, 455 
Quasi-minimal residual (QMR) 
method, 246 
Quasi-Newton methods, 252 


Rate of convergence, 20, 22 
Reduction to tridiagonal form 
basic Householder process, 312 
calculating eigenvectors, 316 
Lanczos method, 319 
selection of Householder matrix, 
313 


Region of absolute stability 
multistep method, 642 
one-step method, 640 

Regression 
exponential fit, 423 
least squares criterion, 419 
least squares regression line, 419 
logarithmic fit, 422 
power law fit, 422 
quadratic fit, 428 

Residual vector, 181 

Richardson extrapolation, 448 
extrapolation table, 452 
generic formula, 452 
notation, 450 

RKF‘45, 611 

RKV56, 614 

Robin boundary condition, 656, 673, 
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Rolle’s Theorem, A-1 

Romberg integration, 497 
efficient calculation, 498 
error estimate, 499 
generic extrapolation formula, 

497 
notation, 497 

Rootfinding methods 
bisection method, 59 
Laguerre’s method, 128 
method of false position, 71 
Newton’s method, 95 
secant method, 107 

Rootfinding problem, 54 

Rotation matrix, 322 
postmultiplication, 326 
premultiplication, 323 

Roundoff error, 34 
accumulation, 43 
effect on Euler’s method, 554 
effect on finite difference 

approximations, 443 
effect on numerical integration, 
478 

Runge-Kutta methods, 570 

embedded pairs, 611 
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fourth-order 
classical fourth-order method, 
574 
implicit methods, 651 
regions of absolute stability, 640 
second-order 
Heun method, 571 
modified Euler method, 571 
optimal RK2 method, 572 
variable step size 
RKF45, 611 
RKV56, 614 


Scaled partial pivoting, 165 
Secant method, 107 
in shooting method for 
nonlinear BVPs, 714 
in trapezoidal method, 646 
order of convergence, 109 
Sherman-Morrison formula, 253 
Shock, 894 
Shooting method 
handling an artificial singularity, 
706 
linear BVPs 
Dirichlet boundary 
conditions, 699--700 
non-Dirichlet boundary 
conditions, 702-703 
nonlinear BVPs 
conversion to rootfinding 
problem, 713, 715 
notation, 713 
objective function, 713, 716 
Robin condition at left 
endpoint, 717 
solution of rootfinding 
problem, 713 
reversing integration direction, 
707, 718 
Significant digits, 36 
Similar matrices, 309 
Similarity transformation, 309 
Simple enclosure methods, 58 
bisection method, 59 
method of false position, 71 
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Simpson’s rule, 458 
basic error term, 462 
composite formula with error 
term, 470 
Simultaneous relaxation, see Jacobi 
method, 227 
Small-O notation, 448 
SOR method 
component form, 230 
convergence results, 232 
in finite difference method for 
Poisson problem, 758 
iteration matrix, 231 
optimal relaxation parameter, 
234 
splitting, 230 
Spectral radius, 178 
relation to convergence of 
iterative techniques, 224 
relation to matrix norm, 178 
relation to rate of convergence 
for iterative techniques, 233 
Spectrum, 261 
Squared conjugate gradient (CGS) 
method, 246 
Stabilized biconjugate gradient 
(BiGSTAB) method, 246 
Steffensen’s method, 118 
Stiff differential equation, 643 
Stiffness ratio, 644 
Strictly diagonally dominant 
matrices, 211 
convergence of Gauss-Seidel 
method, 232 
convergence of Jacobi method, 
232 
in cubic spline interpolation, 
396, 399 
relation to solution of linear 
systems, 211 
Successive over-relaxation, see SOR 
method, 232 
Successive relaxation, 
see Gauss-Seidel method, 
229 


Symmetric positive definite matrices, 
212 
Cholesky decomposition, 216 
convergence of Gauss-Seidel 
method, 233 
convergence of SOR method, 
233 
relation to conjugate gradient 
method, 237 
relation to solution of linear 
systems, 215 
Synthetic division, 127 
modified version, 363 


Taylor methods, 560-565 
Taylor’s theorem, 25, A-2 
in two variables, 571 
Trapezoidal method (IVP solver), 
591, 645 
A-stability, 645 
region of absolute stability, 645 
solving implicit equation, 645 
Trapezoidal rule, 457 
basic error term, 461 
composite formula with error, 
468 
Tridiagonal matrices, 217 
decomposition algorithm, 217, 
A-3 
in BTCS method, 810 
in Crank-Nicolson scheme, 812 
in cubic spline interpolation, 395 
_in finite difference method, 662, 
690 


Underflow, 33 
Upper triangular matrix, 150 
Upwind scheme 
advection equation 
amplitude and phase errors, 
891 

CFL condition, 891 

finite difference equation, 889 

modified equation, 892 


systems, 895 

von Neumann stability 
analysis, 891 

convection-diffusion equation 

finite difference equation, 915, 
920 

modified equation, 915 

stability, 915 


Variable step size algorithms 
multistep methods 
Adams fourth-order 
predictor-corrector, 617 
estimating truncation error, 
615 
Milne device, 617 
step size control, 617 
one-step methods 
estimating truncation error, 
609 
RKF45, 611 
RKV56, 614 
step size control, 610 
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Vector norm, 171 
[,, 180 
lp (Euclidean), 171, 269 
leo (maximum), 171 
convergence, 174 
equivalence, 174 
von Neumann stability analysis, 824 
amplification factor, 824 
VS.PC4, see Adams fourth-order 
predictor-corrector, variable 
step size algorithm, 617 


Weierstrass Approximation 
Theorem, 339 
Weight function, 381, 488 
Weighted Mean-Value Theorem for 
Integrals, 460, A-1 
Wielandt deflation 
basic transformation, 297 
deflation vector, 299 
problem size reduction, 299 
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“lam extremely impressed with Bradie’s book. His passion for explaining 
things as clearly and understandably as possible, his thorough research of the 
literature for bringing relevant and pedagogically sound examples from outside 
mathematics, and his crisp and clear style will certainly make this text an instant 
success, This is one of the better texts in Numerical Analysis that | have ever 
seen, and | congratulate the author for producing such a gem.” 


“The chapters in this book are of uniformly high standards, Chapter 1 in particular 
isagem, The treatments of floating point number systems and of floating point 
arithmetic are especially good. These are topics that are often glossed over in 
other books, and which are often difficult for students to grasp. The book is 
extremely well written: the style is clear, the prose flows smoothly, the pace is 
unhurried, the tone is friendly and conversational, the examples and exercises 

are interesting and relevant, and the amount of detail is far greater than in any 
textbook of its kind that | have ever seen. For these reasons, it will certainly 
appeal to my students,” 


“I think the tone will appeal to my students: It is relaxed and friendly without being 
wordy and effusive. The style is a very readable compromise between proof and 
technical detail on the one hand, and concepts with applications on the other. 

| think he addresses this fundamental challenge in a way that my students would 
like. Bradie has decided to include lots of worked examples accompanied by 
plots. The plots facilitate the inclusion of such a large number of examples, by 
succinctly communicating the point of each. This reduces the effort needed to 
understand the ideas behind the example, (I think students simply will not read 
the book if it takes too much effort. Bradie can include more exercises than is 
typical because the illustrations ease the communication.)” 


“| like the way Bradie presents the materials in each chapter. He gives a 
mathematics review on what is needed at the beginning of each chapter. 
After refreshing students’ memories, he begins with the simplest, most basic 
methods and then progresses gradually to more advanced topics. The book 
is well written and student-friendly. It provides a lot of examples and exercise 
problems. The book is written in the way that is easy for students to read. 
For instance, for each method, there is at least one fully worked example that 
helps students to understand the concept and the method." 
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“Lam extremely impressed with Bradie’s book. His passion for explaining 
things as clearly and understandably as possible, his thorough research of the 
literature for bringing relevant and pedagogically sound examples from outside 
mathematics, and his crisp and clear style will certainly make this text an instant 
success. This is one of the better texts in Numerical Analysis that | have ever 
seen, and | congratulate the author for producing such a gem.” 
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“The chapters in this book are of uniformly high standards. Chapter 1 in particular 
is a gem. The treatments of floating point number systems and of floating point 
arithmetic are especially good. These are topics that are often glossed over in 
other books, and which are often difficult for students to grasp. The book is 
extremely well written: the style is clear, the prose flows smoothly, the pace is 
unhurried, the tone is friendly and conversational, the examples and exercises 

are interesting and relevant, and the amount of detail is far greater than in any 
textbook of its kind that | have ever seen. For these reasons, it will certainly 
appeal to my students.” 


“| think the tone will appeal to my students: It is relaxed and friendly without being 
wordy and effusive. The style is a very readable compromise between proof and 
technical detail on the one hand, and concepts with applications on the other. 
| think he addresses this fundamental challenge in a way that my students would 
like. Bradie has decided to include lots of worked examples accompanied by 
plots. The plots facilitate the inclusion of such a large number of examples, by 
succinctly communicating the point of each. This reduces the effort needed to 
understand the ideas behind the example, (I think students simply will not read 
the book if it takes too much effort. Bradie can include more exercises than is 
typical because the illustrations ease the communication.)” 

Mark Arnold 


“| like the way Bradie presents the materials in each chapter. He gives a 
mathematics review on what is needed at the beginning of each chapter. 
After refreshing students’ memories, he begins with the simplest, most basic 
methods and then progresses gradually to more advanced topics. The book 
is well written and student-friendly. It provides a lot of examples and exercise 
problems. The book is written in the way that is easy for students to read. 
For instance, for each method, there is at least one fully worked example that 
helps students to understand the concept and the method.” 

MULL 
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